How to reshape input for keras LSTM?

How to reshape input for keras LSTM? - python

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).

The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Related

How to shape input array appropriately for ML

I apologize in advance if this question seems silly but I am trying to understand a machinelearningmastery blog about LSTM type ML algorithms, more specifically on how the input data gets reshaped. And I dont have a ton of wisdom here on the subject or a CS degree for that matter
About half way into the blog about LSTM CNN section, Jason talks about:
The first step is to split the input sequences into subsequences that
can be processed by the CNN model. For example, we can first split our
univariate time series data into input/output samples with four steps
as input and one as output. Each sample can then be split into two
sub-samples, each with two time steps. The CNN can interpret each
subsequence of two time steps and provide a time series of
interpretations of the subsequences to the LSTM model to process as
input.
We can parameterize this and define the number of subsequences as
n_seq and the number of time steps per subsequence as n_steps. The
input data can then be reshaped to have the required structure:
[samples, subsequences, timesteps, features]
My question is this a requirement for the data to be only shaped into 4 steps? Or can it be larger? This code below will attempt to print the array, I am using my own data sample here on my git account.
import pandas as pd
import numpy as np
# univariate data preparation
from numpy import array
df = pd.read_csv("trainData.csv")
df = df[['kW']].shift(-1)
df = df.dropna()
raw_seq = df.values.tolist()
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# define input sequence
#raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 4
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# reshape from [samples, timesteps] into [samples, subsequences, timesteps, features]
n_features = 1
n_seq = 2
n_steps = 2
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
# summarize the data
for i in range(len(X)):
print(X[i], y[i])
The code above works but when I change n_steps = 7 (from 4) I get this shape error.
File "convArray.py", line 39, in <module>
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
ValueError: cannot reshape array of size 2499 into shape (357,2,2,1)
The reason I want to try and use 7 time steps is the data that I am experimenting with is electrical demand units for a building per day, and 7 days in a week would be an ideal experimental time step!
Any tips greatly appreciated

The problem is mentioned in the error in this case. The problem is that you are trying to reshape the array to a specific shape but it is not possible. The array X has 2499 elements and 2499 cannot be reshaped to a (357,2,2,1) shape. The product of the numbers in the shape is the number of elements overall. A (357,2,2,1) has 357*2*2*1=1428 elements.
So your code returned an array with 1428 elements overall when n_steps = 2.
I think in your case, it is dependent on raw_seq because its length decides the number of times the for loop inside split_sequence() will run and the arrays it returns.raw_seq depends on the data, so for this dataset, n_steps might be limited.
I'm not 100% sure about it though.

CNN with vector output and 2D image graph input (input is an array)

I am trying to create a CNN in Keras (Python 3.7) which ingests a 2D matrix input (much like a grayscale image) and outputs a 1 dimensional vector. So far I did manage to get results, but I am not sure if what I am doing is correct (or if my intuition is).
I input a 100x50 array into my convolutional layer. This 2D array holds the peak information at every position (ie. x axis pertains to the position, y-axis pertains to the frequency, and each cell gives the intensity). The 3D graph of this shows something akin to the one given in this link.
From the (all of the) literature I have read, I learned that CNN accepts image data--image is converted into pixel values and then repeatedly convolved and pooled to get the output. However, I am using a MatLab simulator to get my input data, and I have access to the raw 2D array containing information on the peak frequency at each point.
My intuition is this: if we normalize each cell and feed the information to the CNN, it will be as if I fed the normalized pixel values of the image to the CNN, since my raw 2D array also has height, width and depth=1, like an image.
Please enlighten me if my thinking is correct or wrong.
My code is as follows:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import keras
'''load sample input'''
BGS1 = pd.read_csv("C:/Users/strain1_input.csv")
BGS2 = pd.read_csv("C:/Users/strain2_input.csv")
BGS3 = pd.read_csv("C:/Users/strain3_input.csv")
BGS_ = np.array([BGS1, BGS2, BGS3]) #3x100x50 array
BGS_normalized = BGS_/np.amax(BGS_)
'''load sample output'''
BFS1 = pd.read_csv("C:/Users/strain1_output.csv")
BFS2 = pd.read_csv("C:/Users/strain2_output.csv")
BFS3 = pd.read_csv("C:/Users/strain3_output.csv")
BFS_ = np.array([BFS1, BFS2, BFS3]) #3x100
BFS_normalized = BFS/50 #since max value for each cell is 50
#after splitting data into training, validation and testing sets,
output_nodes = 100
n_classes = 1
batch_size_ = 8 #so far, optimized for 8 batch size
epoch = 100
input_layer = Input(shape=(45,300,1))
conv1 = Conv2D(16,3,padding="same",activation="relu", input_shape =
(45,300,1))(input_layer)
pool1 = MaxPooling2D(pool_size=(2,2),padding="same")(conv1)
flat = Flatten()(pool1)
hidden1 = Dense(10, activation='softmax')(flat) #relu
batchnorm1 = BatchNormalization()(hidden1)
output_layer = Dense(output_nodes*n_classes, activation="softmax")(batchnorm1)
output_layer2 = Dense(output_nodes*n_classes, activation="relu")(output_layer)
output_reshape = Reshape((output_nodes, n_classes))(output_layer2)
model = Model(inputs=input_layer, outputs=output_reshape)
print(model.summary())
model.compile(loss='mean_squared_error', optimizer='adam', sample_weight_mode='temporal')
model.fit(train_X,train_label,batch_size=batch_size_,epochs=epoch)
predictions = model.predict(train_X)

what you did is exactly the strategy used to input non image data in to 2d convolutional layers. As long the model predicts correctly, what you did is correct. its just that CNN perform very poorly on non-image data or there might be chances to overfit. But then again, as long it performs correctly then its good.

How to test on new values in LSTM in python

I have just created an LSTM model that predicts multidimensional numpy array using the time frame 7. The column index ranges from 1, because the 0th index column is actually a date value. Now my model does pretty well for the test set till March 2018 for which I have ground truth value. Now I wanted to predict for the next 1 year. I am stuck in this prediction part because I dont have a ground truth to feed into the model. I just have to give the next following dates. Could you please help me how this prediction can be achieved? Let me know if you need any more details other than data.
Please find the below code
def build_model(NanWah):
NanWah_data_model1=NanWah
list_range=int(NanWah_data_model1.shape[0]*0.8)
rest_list_range=(NanWah_data_model1.shape[0]-list_range)
NanWah_training_set=NanWah_data_model1.iloc[:list_range,1:].values
from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler(feature_range=(0,1)) # 0 and 1 scaling it
NanWah_training_set_scaled=sc.fit_transform(NanWah_training_set)
X_train=[]
y_train=[]
for i in range(7,list_range):
X_train.append(NanWah_training_set_scaled[i-7:i,:])
y_train.append(NanWah_training_set_scaled[i])
X_train,y_train=np.array(X_train),np.array(y_train)
X_train=np.reshape(X_train,(X_train.shape[0],X_train.shape[1],13))
from keras.models import Sequential
from keras.layers import LSTM,Dense,Dropout,Activation
regressor=Sequential()
regressor.add(LSTM(units=50,return_sequences=True,input_shape=(X_train.shape[1],13)))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=True))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=True))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=True))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=False))
regressor.add(Dense(units=13))
regressor.compile(optimizer="adam",loss="mean_squared_error")
regressor.fit(X_train,y_train,epochs=5,batch_size=10)
NanWah_test_set=NanWah_data_model1.iloc[list_range:,1:].values
inputs=NanWah_test_set
inputs=sc.transform(inputs)
X_test=[]
for i in range(7,rest_list_range):
X_test.append(inputs[i-7:i,:])
X_test=np.array(X_test)
X_test=np.reshape(X_test,(X_test.shape[0],X_test.shape[1],13))
predicted_values=regressor.predict(X_test)
predicted_values=sc.inverse_transform(predicted_values)
predicted_water_m3=predicted_values[:,9:10]
predicted_electricity_kwh=predicted_values[:,7:8]
Thank you in advance

I got the answer for this question.
Here is what I did.
Have a numpy array which has the last n values ( where n is the time that LSTM wants to look back)
Append the 0th index of the numpy array again.
Reshape it
Predict the value using the model built
Delete the last index in the numpy array and then append the array with the predicted value
Continue this till you find the desired number of values for the records.
Sample code is given below:
inputs=Test.values # this contains the last 60 values of the training record
inputs = inputs.reshape(-1,1)
# Scale inputs but not actual test values
inputs = sc.transform(inputs)
# I am keeping it for 60 look backs and finding only 5 records
for test in range(0,5):
inputs=np.append(inputs,inputs[0])
inputs=inputs.reshape(-1,1)
print(inputs.shape)
X_test=[]
for i in range(60, 61):
X_test.append(inputs[test:i+test,0])
# make list to array
X_test = np.array(X_test)
print(X_test)
X_test = np.reshape(X_test,(X_test.shape[0], X_test.shape[1],1))
predicted_stock_price = regressor.predict(X_test)
print("for the first iteration {}".format(predicted_stock_price))
inputs=np.delete(inputs,len(inputs)-1,axis=0)
inputs=np.append(inputs,predicted_stock_price)
inputs=inputs.reshape(-1,1)

Reshape Keras Input for LSTM

I have two ndarrays, inputs and results, both consisting of multiple arrays looking like this:
inputs = [
[[1,2],[2,2],[3,2]],
[[2,1],[1,2],[2,3]],
[[2,2],[1,1],[3,3]],
...
]
results = [
[3,4,5],
[3,3,5],
[4,2,6],
...
]
I managed to split them up into train and test arrays, where train contains 66% of the arrays and test the other 33%. Now I'd like to reshape them for further use in my LSTM but my script fails when inputting them into np.reshape() function.
split = int(round(0.66 * results.shape[0]))
train_results = results[:split, :]
train_inputs = inputs[:split, :]
test_results = results[split:, :]
test_inputs = inputs[split:, :]
X_train = np.reshape(train_inputs, (train_inputs.shape[0], train_inputs.shape[1], 1))
X_test = np.reshape(test_inputs, (test_inputs.shape[0], test_inputs.shape[1], 1))
Please tell me how to use np.reshape() correctly in this case.
Basically I am loosely following this tutorial: https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent

You just pass a tuple to np.reshape.
For an LSTM layer, you need the shape like (NumberOfExamples, TimeSteps, FeaturesPerStep).
So, we need to know how many steps your sequence has. By the looks of your X array, I'll suppose you have 3 steps and 2 features.
If that's the case:
X_train = train_inputs.reshape((split,3,2))
X_test = X_test.reshape((test_inputs.shape[0], 3, 2))
If, otherwise, you want 6 steps of one feature, the shape is (split,6,1). You can do anything, as long as the multiplication of the three elements in the shape must remain always the same
For the results. Do you want the results to be a result in sequence, matching the input steps? Or are they just single outputs (two independent outputs for the entire sequence)?
Since you've got 3 results, and I have assumed you have 3 time steps, I'll assume these 3 results are in sequence as well, so, I'll reshape them as:
Y_train = train_results.reshape((split,3,1)) #three steps, one result per step
#for this to work, your last LSTM layer should use `return_sequences=True`.
But if they are 3 independent results:
Y_train = train_results.reshape((split,3))
#for this to work, you must have 3 cells in the last layer, be it a Dense or an LSTM. But this LSTM must have `return_sequences=False`.

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.