Use RobustScaler in a LSTM with Keras - python

I want to scale time series data with outliers and use it in a LSTM model with Keras.
My code for the scaling is:
# Train Data
scaler = RobustScaler().fit(train)
train = pd.DataFrame(scaler.fit_transform(train))
train = train.values
# Test Data
test = pd.DataFrame(scaler.transform(test))
test = test.values
Afterwards, I put the data into 3D format for Keras:
# split a multivariate sequence into samples
def split_sequences(sequences, n_steps):
X, y = list(), list()
for i in range(len(sequences)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the dataset
if end_ix > len(sequences)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequences[i:end_ix, :], sequences[end_ix, :12]
X.append(seq_x)
y.append(seq_y)
return np.array(X), np.array(y)
# choose a number of time steps
n_steps = 30
# convert into train input/output
X_trai, y_trai = split_sequences(train, n_steps)
print(X_trai.shape, y_trai.shape)
# convert into test input/output
X_test, y_test = split_sequences(test, n_steps)
print(X_test.shape, y_test.shape)
The training and prediction works well, however, I am not able to inverse transform the predicted y data of the test dataset.
My questions:
Is the above scaling method correct?
If yes, How can I regain the original scale of my y_hat predictions to compare it with the original y test dataset?
Thank you!

I am pretty new to this, but this is the way I've been able to do it successfully:
Create a transformer for the various features
Create a transformer for the target (i.e., the thing you are predicting)
Transform the data
Create the dataset
Create and run the model
Use the "inverse_transform" function to bring the data back to its original form
Here is an example of what I am trying to describe, below.
Note: This does cause a "setting with copy warning" that I need to resolve as in "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value, self.name)"
from sklearn.preprocessing import RobustScaler
f_columns = ['A',
'B',
'C',
'D'
]
f_transformer = RobustScaler()
target_transformer = RobustScaler()
f_transformer = f_transformer.fit(train[f_columns].to_numpy())
target_transformer = target_transformer.fit(train[['target']])
train.loc[:, f_columns] = f_transformer.transform(train[f_columns].to_numpy()).copy()
train['target'] = DA_transformer.transform(train[['target']]).copy()
#Steps 4 and 5
y_trai_inv = target_transformer.inverse_transform(y_trai.reshape(1, -1))
y_test_inv = target_transformer.inverse_transform(y_test.reshape(1, -1))
y_pred_inv = target_transformer.inverse_transform(y_pred)

Related

Create train and test with lags of multiple features

I have a classification problem for which I want to create a train and test dataframe with 21 lags of multiple features (X-variables). I already have an easy way to do this with only one feature but I don't know how to adjust this code if I want to use more variables (e.g. df['ETHLogReturn']).
The code I have for one variable is:
Ntest = 252
train = df.iloc[:-Ntest]
test = df.iloc[-Ntest:]
# Create data ready for machine learning algoritm
series = df['BTCLogReturn'].to_numpy()[1:] # first change is NaN
# Did the price go up or down?
target = (targets > 0) * 1
T = 21 # 21 Lags
X = []
Y = []
for t in range(len(series)-T):
x = series[t:t+T]
X.append(x)
y = target[t+T]
Y.append(y)
X = np.array(X).reshape(-1,T)
Y = np.array(Y)
N = len(X)
print("X.shape", X.shape, "Y.shape", Y.shape)
#output --> X.shape (8492, 21) Y.shape (8492,)
Then I create my train and test datasets like this:
Xtrain, Ytrain = X[:-Ntest], Y[:-Ntest]
Xtest, Ytest = X[-Ntest:], Y[-Ntest:]
# example of model:
lr = LogisticRegression()
lr.fit(Xtrain, Ytrain)
print(lr.score(Xtrain, Ytrain))
print(lr.score(Xtest, Ytest))
Does anyone have a suggestion how to adjust this code for a model with lagging variables of multiple columns? Like:
df[['BTCLogReturn','ETHLogReturn']]
Many thanks for your help!

Time series data preparation for LSTM classification

I am trying to fit simple LSTM model to perform binary classification on multivariate time series data. I am confused with the time series data preparation steps to fed into the model. Most of the online materials covered data preparation for prediction(regression problem) using LSTM. For my case, I am trying to prepare data for classification using LSTM. How to make sure that I am not messing with the temporal sequence and class balance while performing train and test split? My class labels are highly imbalanced. Is my code correct for LSTM classification? I appreciate your time. Thanks!
Sample data:
sequence_length = 10
def generate_data(X, y, sequence_length = 10, step = 1):
X_local = []
y_local = []
for start in range(0, len(data) - sequence_length, step):
end = start + sequence_length
X_local.append(X[start:end])
y_local.append(y[end-1])
return np.array(X_local), np.array(y_local)
X_sequence, y = generate_data(data.loc[:, "V1":"V4"].values, data.Class)
Shape of the sequence:
X_sequence.shape, y.shape
((12237889, 10, 4), (12237889,))
Training and test data split:
training_size = int(len(X_sequence) * 0.7)
X_train, y_train = X_sequence[:training_size], y[:training_size]
X_test, y_test = X_sequence[training_size:], y[training_size:]
Shape of the training and test data:
X_train.shape, X_test.shape
((8566522, 10, 4), (3671367, 10, 4))
y_train.shape, y_test.shape
((8566522,), (3671367,))

Passing dataframe to keras sequential model

I'm trying to build and train a simple MLP model using keras.Sequential().
However, I'm having issues when, after each training epoch, I try to evaluate the current status of the model on the train and test data.
I'm having this problem on a couple different datasets, one of them is the "CAR DETAILS FROM CAR DEKHO" dataset, you can find it here
This is what I'm doing so far:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
and I'm getting the error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
I guess the error is due to me passing X_train (which is a dataframe) directly to net.
I also tried using again:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
like when creating training batches, but it returns another error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
Finally, I tried using:
y_pred_train = net.predict(X_train)
the weird thing in this case is that I got an OOM error, referring to a tensor with shape[76571,76571]:
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
but the X_train datagrame has shape (76571, 19), so I don't understand what is happening.
What is the correct way to do this?
Your code mostly looks OK, the issue must be with the data that you pass.
Check content and datatypes of the data that you feed.
Try converting pandas slices into np.arrays, re-check their dimensions and then feed np.arrays to load_array().
Also try smaller batches, like 64 (not 5000).
UPDATE:
Apparently when you pass X_batch to the model you pass tf.tensor, but later when you pass whole X_train or X_test - you pass pd.DataFrames and the model gets confused.
You should change just 2 lines:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here
The issue looks like it is related to the data (as Poe Dator says). What I believe is going on is that your network has some input shape based on the batches of data it is receiving. Then when you are trying to predict or call your network on the data (this also recomputes shapes since it calls the build() function), it tries to get the data into the shape it expects. I think specifically it is expecting a shape of (batch, 1, 19) but with your data in (76571, 19) it is not finding the correct shape.
A couple of easy steps to work on this would be:
Call net.summary() to see what the shapes it believes it is getting before and after training
Provide the input shape to the first layer, net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
Slice your X data in the same shape as your training data.
Add a dimension to your data so it is (76571, 1, 19) to explicitly shape it as well.
Also as noted above, smaller batch sizes would be best. I would also recommend using the model.train() method instead of handling gradients if you don't have a lot of experience with tensorflow. This saves you code and is easier to ensure you are handling your model correctly during training.

I wonder if I was right about the implement of lstm layer using keras

Here is my model definition:
model = Sequential()
model.add(LSTM(i, input_shape=(None, 1), return_sequences=True))
model.add(Dropout(l))
model.add(LSTM(j))
model.add(Dropout(l))
model.add(Dense(k))
model.add(Dropout(l))
model.add(Dense(1))
and here is result
p = model.predict(x_test)
plt.plot(y_test)
plt.plot(p)
The sequential input represents the past signal in previous time-steps, the output is predicting the signal in next time-step. After splitting the training and testing data, the predictions on the test data is as follows:
The figure shows almost a perfect match with gold test data and the predictions. Is it possible to predict with such high accuracy?
I think something is wrong because there's no volatility. So I wonder if it's been implemented properly.
If the implementation is correct, how can you get the following(next) value?
Is it right to do this implement?
a = x_test[-1]
b = model.predict(a)
c = model.predict(b)
...
To sum up the question:
Is the implementation right way?
I wonder how to get the value of the next data.
def create_dataset(signal_data, look_back=1):
dataX, dataY = [], []
for i in range(len(signal_data) - look_back):
dataX.append(signal_data[i:(i + look_back), 0])
dataY.append(signal_data[i + look_back, 0])
return np.array(dataX), np.array(dataY)
train_size = int(len(signal_data) * 0.80)
test_size = len(signal_data) - train_size - int(len(signal_data) * 0.05)
val_size = len(signal_data) - train_size - test_size
train = signal_data[0:train_size]
val = signal_data[train_size:train_size+val_size]
test = signal_data[train_size+val_size:len(signal_data)]
x_train, y_train = create_dataset(train, look_back)
x_val, y_val = create_dataset(val, look_back)
x_test, y_test = create_dataset(test, look_back)
I use create_dataset with look_back=20.
signal_data is preprocessed with min-max normalisation MinMaxScaler(feature_range=(0, 1)).
Is the implementation right way?
Your code seems correct. I think you are not getting surprising results. You need to compare the results with a baseline that next prediction is randomly sampled from the range of possible day-to-day change. This way at least you can understand if your model is doing better than random sampling.
delta_train = train[1][1:] - train[1][:-1]
delta_range_train = delta_train.max()-delta_train.min()
# generating the baseline based on the change range in training:
random_p = test[0][:, -1] + (np.random.rand(test[0].shape[0])-0.5)*delta_range_train
You can check if your results are better than just a random sample random_p.
I wonder how to get the value of the next data.
this gives you the last data point in the test set:
a = x_test[-1:]
then, here you are predicting the next point day:
b = model.predict(a)
based on look_back you may need to keep some of the datapoints from to predict the next-next point:
c = model.predict(np.array([list(a[0,1:])+[b]])

Fine-tuning a neural network in tensorflow

I've been working on this neural network with the intent to predict TBA (time based availability) of simulated windmill parks based on certain attributes. The neural network runs just fine, and gives me some predictions, however I'm not quite satisfied with the results. It fails to notice some very obvious correlations that I can clearly see by myself. Here is my current code:
`# Import
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
maxi = 0.96
mini = 0.7
# Make data a np.array
data = pd.read_csv('datafile_ML_no_avg.csv')
data = data.values
# Shuffle the data
shuffle_indices = np.random.permutation(np.arange(len(data)))
data = data[shuffle_indices]
# Training and test data
data_train = data[0:int(len(data)*0.8),:]
data_test = data[int(len(data)*0.8):int(len(data)),:]
# Scale data
scaler = MinMaxScaler(feature_range=(mini, maxi))
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 0:5]
y_train = data_train[:, 6:7]
X_test = data_test[:, 0:5]
y_test = data_test[:, 6:7]
# Number of stocks in training data
n_args = X_train.shape[1]
multi = int(8)
# Neurons
n_neurons_1 = 8*multi
n_neurons_2 = 4*multi
n_neurons_3 = 2*multi
n_neurons_4 = 1*multi
# Session
net = tf.InteractiveSession()
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_args])
Y = tf.placeholder(dtype=tf.float32, shape=[None,1])
# Initialize1s
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg",
distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()
# Hidden weights
W_hidden_1 = tf.Variable(weight_initializer([n_args, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))
# Output weights
W_out = tf.Variable(weight_initializer([n_neurons_4, 1]))
bias_out = tf.Variable(bias_initializer([1]))
# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2),
bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3),
bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4),
bias_hidden_4))
# Output layer (transpose!)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))
# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)
# Init
net.run(tf.global_variables_initializer())
# Fit neural net
batch_size = 10
mse_train = []
mse_test = []
# Run
epochs = 10
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
net.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 50) == 0:
mse_train.append(net.run(mse, feed_dict={X: X_train, Y: y_train}))
mse_test.append(net.run(mse, feed_dict={X: X_test, Y: y_test}))
pred = net.run(out, feed_dict={X: X_test})
print(pred)`
Have tried to tweak around with the number of hidden layers, number of nodes per layer, number of epochs to run and trying different activation functions and optimizers. However, I am quite new to neural networks, so there might be something very obvious that I'm missing.
Thanks in advance to anyone who managed to read through all of that.
It will make is much easier you you will share a small dataset that illustrate the problem. However, I will state some of the issues with non-standards datasets and how to overcome them.
Possible solutions
Regularization and validation-based optimization - are methods that are always good to try when looking for some extra-accuracy. See dropout methods here (original paper), and some overview here.
Unbalanced data - Sometimes of the time series categories/events behave like anomalies, or just in unbalanced ways. If you read a book, words like the or it will appear much more times than warehouse or such. This can become a problem if your main task is to detect the word warehouse and you train your network (even lstms) in traditional ways. A way to overcome this problem is by balancing the samples (creating balanced datasets) or to give more weight to low-frequent categories.
Model structure - sometimes fully connected layers are not enough. See computer vision problems for instance, where we train using convolution layers. The convolution and pooling layers enforce structure on the model, which is suitable for images. This is also some sort of regulation, since we have less parameters in those layers. In time-series problems, convolutions are also possible and turns out that works just fine. See example in Conditional Time Series Forecasting with Convolution Neural Networks.
The above suggestions are presented in the order I would suggest to try.
Good luck!

Categories