I am trying to fit simple LSTM model to perform binary classification on multivariate time series data. I am confused with the time series data preparation steps to fed into the model. Most of the online materials covered data preparation for prediction(regression problem) using LSTM. For my case, I am trying to prepare data for classification using LSTM. How to make sure that I am not messing with the temporal sequence and class balance while performing train and test split? My class labels are highly imbalanced. Is my code correct for LSTM classification? I appreciate your time. Thanks!
Sample data:
sequence_length = 10
def generate_data(X, y, sequence_length = 10, step = 1):
X_local = []
y_local = []
for start in range(0, len(data) - sequence_length, step):
end = start + sequence_length
X_local.append(X[start:end])
y_local.append(y[end-1])
return np.array(X_local), np.array(y_local)
X_sequence, y = generate_data(data.loc[:, "V1":"V4"].values, data.Class)
Shape of the sequence:
X_sequence.shape, y.shape
((12237889, 10, 4), (12237889,))
Training and test data split:
training_size = int(len(X_sequence) * 0.7)
X_train, y_train = X_sequence[:training_size], y[:training_size]
X_test, y_test = X_sequence[training_size:], y[training_size:]
Shape of the training and test data:
X_train.shape, X_test.shape
((8566522, 10, 4), (3671367, 10, 4))
y_train.shape, y_test.shape
((8566522,), (3671367,))
Related
I want to scale time series data with outliers and use it in a LSTM model with Keras.
My code for the scaling is:
# Train Data
scaler = RobustScaler().fit(train)
train = pd.DataFrame(scaler.fit_transform(train))
train = train.values
# Test Data
test = pd.DataFrame(scaler.transform(test))
test = test.values
Afterwards, I put the data into 3D format for Keras:
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
X, y = list(), list()
for i in range(len(sequences)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the dataset
if end_ix > len(sequences)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :12]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
# choose a number of time steps
n_steps = 30
# convert into train input/output
X_trai, y_trai = split_sequences(train, n_steps)
print(X_trai.shape, y_trai.shape)
# convert into test input/output
X_test, y_test = split_sequences(test, n_steps)
print(X_test.shape, y_test.shape)
The training and prediction works well, however, I am not able to inverse transform the predicted y data of the test dataset.
My questions:
Is the above scaling method correct?
If yes, How can I regain the original scale of my y_hat predictions to compare it with the original y test dataset?
Thank you!
I am pretty new to this, but this is the way I've been able to do it successfully:
Create a transformer for the various features
Create a transformer for the target (i.e., the thing you are predicting)
Transform the data
Create the dataset
Create and run the model
Use the "inverse_transform" function to bring the data back to its original form
Here is an example of what I am trying to describe, below.
Note: This does cause a "setting with copy warning" that I need to resolve as in "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value, self.name)"
from sklearn.preprocessing import RobustScaler
f_columns = ['A',
'B',
'C',
'D'
]
f_transformer = RobustScaler()
target_transformer = RobustScaler()
f_transformer = f_transformer.fit(train[f_columns].to_numpy())
target_transformer = target_transformer.fit(train[['target']])
train.loc[:, f_columns] = f_transformer.transform(train[f_columns].to_numpy()).copy()
train['target'] = DA_transformer.transform(train[['target']]).copy()
#Steps 4 and 5
y_trai_inv = target_transformer.inverse_transform(y_trai.reshape(1, -1))
y_test_inv = target_transformer.inverse_transform(y_test.reshape(1, -1))
y_pred_inv = target_transformer.inverse_transform(y_pred)
I'm trying to build and train a simple MLP model using keras.Sequential().
However, I'm having issues when, after each training epoch, I try to evaluate the current status of the model on the train and test data.
I'm having this problem on a couple different datasets, one of them is the "CAR DETAILS FROM CAR DEKHO" dataset, you can find it here
This is what I'm doing so far:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
and I'm getting the error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
I guess the error is due to me passing X_train (which is a dataframe) directly to net.
I also tried using again:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
like when creating training batches, but it returns another error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
Finally, I tried using:
y_pred_train = net.predict(X_train)
the weird thing in this case is that I got an OOM error, referring to a tensor with shape[76571,76571]:
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
but the X_train datagrame has shape (76571, 19), so I don't understand what is happening.
What is the correct way to do this?
Your code mostly looks OK, the issue must be with the data that you pass.
Check content and datatypes of the data that you feed.
Try converting pandas slices into np.arrays, re-check their dimensions and then feed np.arrays to load_array().
Also try smaller batches, like 64 (not 5000).
UPDATE:
Apparently when you pass X_batch to the model you pass tf.tensor, but later when you pass whole X_train or X_test - you pass pd.DataFrames and the model gets confused.
You should change just 2 lines:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here
The issue looks like it is related to the data (as Poe Dator says). What I believe is going on is that your network has some input shape based on the batches of data it is receiving. Then when you are trying to predict or call your network on the data (this also recomputes shapes since it calls the build() function), it tries to get the data into the shape it expects. I think specifically it is expecting a shape of (batch, 1, 19) but with your data in (76571, 19) it is not finding the correct shape.
A couple of easy steps to work on this would be:
Call net.summary() to see what the shapes it believes it is getting before and after training
Provide the input shape to the first layer, net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
Slice your X data in the same shape as your training data.
Add a dimension to your data so it is (76571, 1, 19) to explicitly shape it as well.
Also as noted above, smaller batch sizes would be best. I would also recommend using the model.train() method instead of handling gradients if you don't have a lot of experience with tensorflow. This saves you code and is easier to ensure you are handling your model correctly during training.
I have a dataset that looks like this:
df.head(5)
data labels
0 [0.0009808844009380855, 0.0008974465127279559] 1
1 [0.0007158940267629654, 0.0008202958833774329] 3
2 [0.00040971929722210984, 0.000393972522972382] 3
3 [7.916243163372941e-05, 7.401835468434177e243] 3
4 [8.447556379936086e-05, 8.600626393842705e-05] 3
The 'data' column is my X and the labels is y. The df has 34890 rows. Each row contains 2 floats. The data represents a bunch of sequential text and each observation is a representation of a sentence. There are 5 classes.
I am training it on this LSTM code:
data = df.data.values
labels = pd.get_dummies(df['labels']).values
X_train, X_test, y_train, y_test = train_test_split(data,labels, test_size = 0.10, random_state = 42)
X_train = X_train.reshape((X_train.shape[0],1,X_train.shape[1])) # shape = (31401, 1, 5)
X_test = X_test.reshape((X_test.shape[0],1,X_test.shape[1])) # shape = (3489, 1, 5)
### y_train shape = (31401, 5)
### y_test shape = (3489, 5)
### Bi_LSTM
Bi_LSTM = Sequential()
Bi_LSTM.add(layers.Bidirectional(layers.LSTM(32)))
Bi_LSTM.add(layers.Dropout(.5))
# Bi_LSTM.add(layers.Flatten())
Bi_LSTM.add(Dense(11, activation='softmax'))
def compile_and_fit(history):
history.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = history.fit(X_train,
y_train,
epochs=30,
batch_size=32,
validation_data=(X_test, y_test))
return history
LSTM_history = compile_and_fit(Bi_LSTM)
The model trains, but the val accuracy is fixed at 53% for every epoch. I am assuming this is because of my class imbalance problem (1 class takes up ~53% of the data, the other 4 are somewhat evenly distributed throughout the remaining 47%).
How do I balance my data? I am aware of typical over/under sampling techniques on non-time series data, but I can't over/under sample because that would mess with the sequential time-series nature of the data. Any advice?
EDIT
I am attempting to use the class_weight argument in Keras to address this. I am passing this dict into the class_weight argument:
class_weights = {
0: 1/len(df[df.label == 1]),
1: 1/len(df[df.label == 2]),
2: 1/len(df[df.label == 3]),
3: 1/len(df[df.label == 4]),
4: 1/len(df[df.label == 5]),
}
Which I am basing off of this recommendation:
https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes
However, the acc/loss is now really awful. I get ~30% accuracy with a dense net, so I expected the LSTM to be an improvement. See acc/loss curves below:
Keras/Tensorflow enable to use class_weight or sample_weights in model.fit method
class_weight affects the relative weight of each class in the calculation of the objective function. sample_weights, as the name suggests, allows further control of the relative weight of samples that belong to the same class
class_weight accepts a dictionary where you compute the weights of each class while sample_weights receive a univariate array of dim == len(y_train) where you assign specific weight to each sample
Here is my model definition:
model = Sequential()
model.add(LSTM(i, input_shape=(None, 1), return_sequences=True))
model.add(Dropout(l))
model.add(LSTM(j))
model.add(Dropout(l))
model.add(Dense(k))
model.add(Dropout(l))
model.add(Dense(1))
and here is result
p = model.predict(x_test)
plt.plot(y_test)
plt.plot(p)
The sequential input represents the past signal in previous time-steps, the output is predicting the signal in next time-step. After splitting the training and testing data, the predictions on the test data is as follows:
The figure shows almost a perfect match with gold test data and the predictions. Is it possible to predict with such high accuracy?
I think something is wrong because there's no volatility. So I wonder if it's been implemented properly.
If the implementation is correct, how can you get the following(next) value?
Is it right to do this implement?
a = x_test[-1]
b = model.predict(a)
c = model.predict(b)
...
To sum up the question:
Is the implementation right way?
I wonder how to get the value of the next data.
def create_dataset(signal_data, look_back=1):
dataX, dataY = [], []
for i in range(len(signal_data) - look_back):
dataX.append(signal_data[i:(i + look_back), 0])
dataY.append(signal_data[i + look_back, 0])
return np.array(dataX), np.array(dataY)
train_size = int(len(signal_data) * 0.80)
test_size = len(signal_data) - train_size - int(len(signal_data) * 0.05)
val_size = len(signal_data) - train_size - test_size
train = signal_data[0:train_size]
val = signal_data[train_size:train_size+val_size]
test = signal_data[train_size+val_size:len(signal_data)]
x_train, y_train = create_dataset(train, look_back)
x_val, y_val = create_dataset(val, look_back)
x_test, y_test = create_dataset(test, look_back)
I use create_dataset with look_back=20.
signal_data is preprocessed with min-max normalisation MinMaxScaler(feature_range=(0, 1)).
Is the implementation right way?
Your code seems correct. I think you are not getting surprising results. You need to compare the results with a baseline that next prediction is randomly sampled from the range of possible day-to-day change. This way at least you can understand if your model is doing better than random sampling.
delta_train = train[1][1:] - train[1][:-1]
delta_range_train = delta_train.max()-delta_train.min()
# generating the baseline based on the change range in training:
random_p = test[0][:, -1] + (np.random.rand(test[0].shape[0])-0.5)*delta_range_train
You can check if your results are better than just a random sample random_p.
I wonder how to get the value of the next data.
this gives you the last data point in the test set:
a = x_test[-1:]
then, here you are predicting the next point day:
b = model.predict(a)
based on look_back you may need to keep some of the datapoints from to predict the next-next point:
c = model.predict(np.array([list(a[0,1:])+[b]])
For my master thesis, I want to predict the price of a stock in the next hour using a LSTM model. My X data contains 30.000 rows with 6 dimensions (= 6 features), my Y data contains 30.000 rows and only 1 dimension (=target variable). For my first LSTM model, I reshaped the X data to (30.000x1x6), the Y data to (30.000x1) and determined the input like this:
input_nn = Input(shape=(1, 6))
I am not sure how to reshape the data and to determine the input shape for the model if I want to increase the timesteps. I still want to predict the stock price in the next hour, but include more previous time steps.
Do I have to add the data from previous timesteps in my X data in the second dimension?
Can you explain what the number of units of a LSTM exactly refers to? Should it be the same as the number of timesteps in my case?
You are on the right track but confusing the number of units with timesteps. The units is a hyper-parameter that controls the output dimension of the LSTM. It is the dimension of the LSTM output vector, so if input is (1,6) and you have 32 units you will get (32,) as in the LSTM will traverse the single timestep and produce a vector of size 32.
Timesteps refers to the size of the history you can your LSTM to consider. So it isn't the same as units at all. Instead of processing the data yourself, Keras has a handy TimeseriesGenerator which will take a 2D data like yours and use a sliding window of some timestep size to generate timeseries data. From the documentation:
from keras.preprocessing.sequence import TimeseriesGenerator
import numpy as np
data = np.array([[i] for i in range(50)])
targets = np.array([[i] for i in range(50)])
data_gen = TimeseriesGenerator(data, targets,
length=10, sampling_rate=2,
batch_size=2)
assert len(data_gen) == 20
batch_0 = data_gen[0]
x, y = batch_0
assert np.array_equal(x,
np.array([[[0], [2], [4], [6], [8]],
[[1], [3], [5], [7], [9]]]))
assert np.array_equal(y,
np.array([[10], [11]]))
which you can use directory in model.fit_generator(data_gen,...) giving you the option to try out different sampling_rates, timesteps etc. You should probably investigate these parameters and how they affect the result in your thesis.
Update with code that is roughly 5 times quicker than the last one:
x = np.load(nn_input + "/EOAN" + "/EOAN_X" + ".npy")
y = np.load(nn_input + "/EOAN" + "/EOAN_Y" + ".npy")
num_features = x.shape[1]
num_time_steps = 500
for train_index, test_index in tscv.split(x):
# Split into train and test set
print("Fold:", fold_counter, "\n" + "Train Index:", train_index, "Test Index:", test_index)
x_train_raw, y_train, x_test_raw, y_test = x[train_index], y[train_index], x[test_index], y[test_index]
# Scaling the data
scaler = StandardScaler()
scaler.fit(x_train_raw)
x_train_raw = scaler.transform(x_train_raw)
x_test_raw = scaler.transform(x_test_raw)
# Creating Input Data with variable timesteps
x_train = np.zeros((x_train_raw.shape[0] - num_time_steps + 1, num_time_steps, num_features), dtype="float32")
x_test = np.zeros((x_test_raw.shape[0] - num_time_steps + 1, num_time_steps, num_features), dtype="float32")
for row in range(len(x_train)):
for timestep in range(num_time_steps):
x_train[row][timestep] = x_train_raw[row + timestep]
for row in range(len(x_test)):
for timestep in range(num_time_steps):
x_test[row][timestep] = x_test_raw[row + timestep]
y_train = y_train[num_time_steps - 1:]
y_test = y_test[num_time_steps - 1:]