Upload one part of dataset at a time in Keras - python

I have a big dataset, its initially dimension is (453732,839). In this dataset, I have subgroups that are related each other and those subgroups have variable dimension. Since I have to train an LSTM, each subgroup must be the same size, so I apply padding to each subgroup so that they are all the same length.
After padding, the dataset becomes about 2000000 rows.
So I'm executing the model.fit() function within a loop where the model.fit() is executed one for each part of the dataset. Inside the loop I'm padding online the part of the dataset to pass to the model.fit(), but at the second part, before the model.fit(), RAM fills up and I can't continue training.
This is the code in which I pad and fit the model:
training_set_portion_size = int(training_dataset.shape[0] / 6)
start_portion_index = 0
for epoch in range(0, 50):
for part in range(0, 4):
end_portion_index = start_portion_index + training_set_portion_size
training_set_portion = training_dataset[start_portion_index:end_portion_index]
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
portion_groups = get_groups_id_count(training_set_portion[:,0])
# Scale dataset portion
training_set_portion = scaler.transform(training_set_portion[:,0:training_set_portion.shape[1]-1])
training_set_portion = np.concatenate((training_set_portion, training_set_portion_labels[:, np.newaxis]), axis=1)
# Pad dataset portion
training_set_portion = pad_groups(training_set_portion, portion_groups)
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
# Exluding group and label from training_set_portion
training_set_portion = training_set_portion[:, 1:training_set_portion.shape[1] - 1]
# Reshape data for LSTM
training_set_portion = training_set_portion.reshape(int(training_set_portion.shape[0]/timesteps), timesteps, features)
training_set_portion_labels = training_set_portion_labels.reshape(int(training_set_portion_labels.shape[0]/timesteps), timesteps)
model.fit(training_set_portion, training_set_portion_labels, validation_split=0.2, shuffle=False, epochs=1,
batch_size=1, workers=0, max_queue_size=1, verbose=1)
* **UPDATE ***
I'm using pandas now, with chunksize, but seems the tensors are concatenating in the loop.
pandas iterator:
training_dataset_iterator = pd.read_csv('/content/drive/My Drive/Tesi_magistrale/en-train.H',
chunksize=80000, sep=",", header=None, dtype=np.float64)
New code:
for epoch in range(0, 50):
for chunk in training_dataset_iterator:
training_set_portion = chunk.values
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
portion_groups = get_groups_id_count(training_set_portion[:,0])
# Scale dataset portion
training_set_portion = scaler.transform(training_set_portion[:,0:training_set_portion.shape[1]-1])
training_set_portion = np.concatenate((training_set_portion, training_set_portion_labels[:, np.newaxis]), axis=1)
# Pad dataset portion
print('Padding portion...\n')
training_set_portion = pad_groups(training_set_portion, portion_groups)
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
# Exluding group and label from training_set_portion
training_set_portion = training_set_portion[:, 1:training_set_portion.shape[1] - 1]
# Reshape data for LSTM
training_set_portion = training_set_portion.reshape(int(training_set_portion.shape[0]/timesteps), timesteps, features)
training_set_portion_labels = training_set_portion_labels.reshape(int(training_set_portion_labels.shape[0]/timesteps), timesteps)
print('Training set portion shape: ', training_set_portion.shape)
model.fit(training_set_portion, training_set_portion_labels, validation_split=0.2, shuffle=False, epochs=1,
batch_size=1, workers=0, max_queue_size=1, verbose=1)
The first print('Training set portion shape: ', training_set_portion.shape) gave me (21327, 20, 837), but the second gave me (43194, 20, 837). I don't understand why.
UPDATE 2
I notice that training_set_portion = pad_groups(training_set_portion, portion_groups), in some way, duplicate data.
Pad groups code:
def pad_groups(dataset, groups):
max_subtree_length= 20
start = 0
rows, cols = dataset.shape
padded_dataset = []
index = 1
for group in groups:
pad = [group[0]] + [0] * (cols - 1)
stop = start + group[1]
subtree = dataset[start:stop].tolist()
padded_dataset.extend(subtree)
subtree_to_pad = max_subtree_length - group[1]
pads = [pad] * subtree_to_pad
padded_dataset.extend(pads)
start = stop
index+=1
padded_dataset = np.array(padded_dataset)
return padded_dataset
How can I do that?
Thank you in advance.

I found a link in TowardsDataScience where they show you 3 methods to fix this problem by using a small library called pandas which is widely used for dataset processing. I hope it is of some help to solve your problem. Here is the link:-
https://towardsdatascience.com/3-simple-ways-to-handle-large-data-with-pandas-d9164a3c02c1
Regards,
Neel Gupta

I solved my issues: there was a bug in my code.

Related

How to include future values in a time series prediction of a RNN in Keras

I currently have a RNN model for time series predictions. It uses 3 input features "value", "temperature" and "hour of the day" of the last 96 time steps to predict the next 96 time steps of the feature "value".
Here you can see a schema of it:
and here you have the current code:
#Import modules
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from tensorflow import keras
# Define the parameters of the RNN and the training
epochs = 1
batch_size = 50
steps_backwards = 96
steps_forward = 96
split_fraction_trainingData = 0.70
split_fraction_validatinData = 0.90
randomSeedNumber = 50
#Read dataset
df = pd.read_csv('C:/Users/Desktop/TestData.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime':[0]}, index_col=['datetime'])
# standardize data
data = df.values
indexWithYLabelsInData = 0
data_X = data[:, 0:3]
data_Y = data[:, indexWithYLabelsInData].reshape(-1, 1)
scaler_standardized_X = StandardScaler()
data_X = scaler_standardized_X.fit_transform(data_X)
data_X = pd.DataFrame(data_X)
scaler_standardized_Y = StandardScaler()
data_Y = scaler_standardized_Y.fit_transform(data_Y)
data_Y = pd.DataFrame(data_Y)
# Prepare the input data for the RNN
series_reshaped_X = np.array([data_X[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
series_reshaped_Y = np.array([data_Y[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
timeslot_x_train_end = int(len(series_reshaped_X)* split_fraction_trainingData)
timeslot_x_valid_end = int(len(series_reshaped_X)* split_fraction_validatinData)
X_train = series_reshaped_X[:timeslot_x_train_end, :steps_backwards]
X_valid = series_reshaped_X[timeslot_x_train_end:timeslot_x_valid_end, :steps_backwards]
X_test = series_reshaped_X[timeslot_x_valid_end:, :steps_backwards]
Y_train = series_reshaped_Y[:timeslot_x_train_end, steps_backwards:]
Y_valid = series_reshaped_Y[timeslot_x_train_end:timeslot_x_valid_end, steps_backwards:]
Y_test = series_reshaped_Y[timeslot_x_valid_end:, steps_backwards:]
# Build the model and train it
np.random.seed(randomSeedNumber)
tf.random.set_seed(randomSeedNumber)
model = keras.models.Sequential([
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 3]),
keras.layers.SimpleRNN(10, return_sequences=True),
keras.layers.TimeDistributed(keras.layers.Dense(1))
])
model.compile(loss="mean_squared_error", optimizer="adam", metrics=['mean_absolute_percentage_error'])
history = model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_valid, Y_valid))
#Predict the test data
Y_pred = model.predict(X_test)
# Inverse the scaling (traInv: transformation inversed)
data_X_traInv = scaler_standardized_X.inverse_transform(data_X)
data_Y_traInv = scaler_standardized_Y.inverse_transform(data_Y)
series_reshaped_X_notTransformed = np.array([data_X_traInv[i:i + (steps_backwards+steps_forward)].copy() for i in range(len(data) - (steps_backwards+steps_forward))])
X_test_notTranformed = series_reshaped_X_notTransformed[timeslot_x_valid_end:, :steps_backwards]
Y_pred_traInv = scaler_standardized_Y.inverse_transform (Y_pred)
Y_test_traInv = scaler_standardized_Y.inverse_transform (Y_test)
# Calculate errors for every time slot of the multiple predictions
abs_diff = np.abs(Y_pred_traInv - Y_test_traInv)
abs_diff_perPredictedSequence = np.zeros((len (Y_test_traInv)))
average_LoadValue_testData_perPredictedSequence = np.zeros((len (Y_test_traInv)))
abs_diff_perPredictedTimeslot_ForEachSequence = np.zeros((len (Y_test_traInv)))
absoluteError_Load_Ratio_allPredictedSequence = np.zeros((len (Y_test_traInv)))
absoluteError_Load_Ratio_allPredictedTimeslots = np.zeros((len (Y_test_traInv)))
mse_perPredictedSequence = np.zeros((len (Y_test_traInv)))
rmse_perPredictedSequence = np.zeros((len(Y_test_traInv)))
for i in range (0, len(Y_test_traInv)):
for j in range (0, len(Y_test_traInv [0])):
abs_diff_perPredictedSequence [i] = abs_diff_perPredictedSequence [i] + abs_diff [i][j]
mse_perPredictedSequence [i] = mean_squared_error(Y_pred_traInv[i] , Y_test_traInv [i] )
rmse_perPredictedSequence [i] = np.sqrt(mse_perPredictedSequence [i])
abs_diff_perPredictedTimeslot_ForEachSequence [i] = abs_diff_perPredictedSequence [i] / len(Y_test_traInv [0])
average_LoadValue_testData_perPredictedSequence [i] = np.mean (Y_test_traInv [i])
absoluteError_Load_Ratio_allPredictedSequence [i] = abs_diff_perPredictedSequence [i] / average_LoadValue_testData_perPredictedSequence [i]
absoluteError_Load_Ratio_allPredictedTimeslots [i] = abs_diff_perPredictedTimeslot_ForEachSequence [i] / average_LoadValue_testData_perPredictedSequence [i]
rmse_average_allPredictictedSequences = np.mean (rmse_perPredictedSequence)
absoluteAverageError_Load_Ratio_allPredictedSequence = np.mean (absoluteError_Load_Ratio_allPredictedSequence)
absoluteAverageError_Load_Ratio_allPredictedTimeslots = np.mean (absoluteError_Load_Ratio_allPredictedTimeslots)
absoluteAverageError_allPredictedSequences = np.mean (abs_diff_perPredictedSequence)
absoluteAverageError_allPredictedTimeslots = np.mean (abs_diff_perPredictedTimeslot_ForEachSequence)
Here you have some test data Download Test Data
So now I actually would like to include not only past values of the features into the prediction but also future values of the features "temperature" and "hour of the day" into the prediction. The future values of the feature "temperature" can for example be taken from an external weather forecasting service and for the feature "hour of the day" the future values are know before (in the test data I have included a "forecast" of the temperature that is not a real forecast; I just randomly changed the values).
This way, I could assume that - for several applications and data - the forecast could be improved.
In a schema it would look like this:
Can anyone tell me, how I can do that in Keras with a RNN (or LSTM)? One way could be to include the future values as independant features as input. But I would like the model to know that the future values of a feature are connected to the past values of a feature.
Reminder: Does anybody have an idea how to do this? I'll highly appreciate every comment.
The standard approach is to use an encoder-decoder architecture (see 1 and 2 for instance):
The encoder takes as input the past values of the features and of the target and returns an output representation.
The decoder takes as input the encoder output and the future values of the features and returns the predicted values of the target.
You can use any architecture for the encoder and for the decoder and you can also consider different approaches for passing the encoder output to the decoder (e.g. adding or concatenating it to the decoder input features, adding or concatenating it to the output of some intermediate decoder layer, or adding it to the final decoder output), the code below is just an example.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.layers import Input, Dense, LSTM, TimeDistributed, Concatenate, Add
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
# define the inputs
target = ['value']
features = ['temperatures', 'hour of the day']
sequence_length = 96
# import the data
df = pd.read_csv('TestData.csv', sep=';', header=0, low_memory=False, infer_datetime_format=True, parse_dates={'datetime': [0]}, index_col=['datetime'])
# scale the data
target_scaler = StandardScaler().fit(df[target])
features_scaler = StandardScaler().fit(df[features])
df[target] = target_scaler.transform(df[target])
df[features] = features_scaler.transform(df[features])
# extract the input and output sequences
X_encoder = [] # past features and target values
X_decoder = [] # future features values
y = [] # future target values
for i in range(sequence_length, df.shape[0] - sequence_length):
X_encoder.append(df[features + target].iloc[i - sequence_length: i])
X_decoder.append(df[features].iloc[i: i + sequence_length])
y.append(df[target].iloc[i: i + sequence_length])
X_encoder = np.array(X_encoder)
X_decoder = np.array(X_decoder)
y = np.array(y)
# define the encoder and decoder
def encoder(encoder_features):
y = LSTM(units=100, return_sequences=True)(encoder_features)
y = TimeDistributed(Dense(units=1))(y)
return y
def decoder(decoder_features, encoder_outputs):
x = Concatenate(axis=-1)([decoder_features, encoder_outputs])
# x = Add()([decoder_features, encoder_outputs])
y = TimeDistributed(Dense(units=100, activation='relu'))(x)
y = TimeDistributed(Dense(units=1))(y)
return y
# build the model
encoder_features = Input(shape=X_encoder.shape[1:])
decoder_features = Input(shape=X_decoder.shape[1:])
encoder_outputs = encoder(encoder_features)
decoder_outputs = decoder(decoder_features, encoder_outputs)
model = Model([encoder_features, decoder_features], decoder_outputs)
# train the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
model.fit([X_encoder, X_decoder], y, epochs=100, batch_size=128)
# extract the last predicted sequence
y_true = target_scaler.inverse_transform(y[-1, :])
y_pred = target_scaler.inverse_transform(model.predict([X_encoder, X_decoder])[-1, :])
# plot the last predicted sequence
plt.plot(y_true.flatten(), label='actual')
plt.plot(y_pred.flatten(), label='predicted')
plt.show()
In the example above the model takes two inputs, X_encoder and X_decoder, so in your case when generating the forecasts you can use the past observed temperatures in X_encoder and the future temperature forecasts in X_decoder.
It is a pytorch code to time series prediction with an known external/exogenous regressor to the given period forecasted.Hope it helps!!!Have a marvellous day !!!
The input format is a 3d Tensor an output 1d array (MISO-Multiple Inputs Single Output)
def CNN_Attention_Bidirectional_LSTM_Encoder_Decoder_predictions(model,data ,regressors, extrapolations_leght):
n_input = extrapolations_leght
pred_list = []
batch = data[-n_input:]
model = model.train()
pred_list.append(torch.cat(( model(batch)[-1], torch.FloatTensor(regressors.iloc[1,[1]]).to(device).unsqueeze(0)),1))
batch = torch.cat((batch[n_input-1].unsqueeze(0), pred_list[-1].unsqueeze(0)),1)
batch = batch[:, 1:, :]
for i in range(n_input-1):
model = model.eval()
pred_list.append(torch.cat((model(batch).squeeze(0), torch.FloatTensor(regressors.iloc[i+1,[1]]).to(device).unsqueeze(0)),1))
batch = torch.cat((batch, pred_list[-1].unsqueeze(0)),1)
batch = batch[:, 1:, :]
model = model.train()
return np.array([pred_list[j].cpu().detach().numpy() for j in range(n_input)])[:,:, 0]

Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively

I have multiple input features and a singular target feature that correspond 1:1 to each other's index; meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets: input[t] <=> target[t]. Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.
Under normal operating procedures, I would use N periods worth of past data in order to predict 1 future value, N periods ahead. As the frame shifts forward in time, each respective slot is filled with the [t+N] forecast, recorded at [t].
Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array or TimeseriesGenerator to batch my data (based on system support). I need to know if the implementation I made produces batches that will do what I expect when running model.fit() in keras. I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.
I'm using an LSTM potentially with the stateful argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency). I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes. On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.
My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array/TimeseriesGenerator such that running model.fit() will train using losses/metrics that compare the target at time [t] with the predicted value at time [t] using inputs at time [t].
import pandas as pd
import numpy as np
use_ts_data = True
try:
# Comment this line out if you want to test timeseries_dataset_from_array
raise ImportError("No TDFA for you")
from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen
use_ts_data = False
def gp2(size):
return np.power(2, int(np.log2((size))))
def train_validate_test_split(
features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
def batch_size_with_buffer(buffer, available, desired, max_batch_size):
batch_size = gp2(min(desired, max_batch_size or np.inf))
if available < batch_size * 3 + buffer:
# If we don't have enough records to support this batch_size, use 1 power lower
batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
return int(batch_size)
memory = max(1, memory)
surplus = memory - 1
test_size_ratio = 1 - train_size_ratio
total_size = features.shape[0]
smallest_size = int(total_size * test_size_ratio / 2)
# Error on insufficient data
def insufficient_data():
raise RuntimeError(
f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
)
if total_size < memory + 3:
insufficient_data()
# Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
batch_size = batch_size_with_buffer(
surplus, total_size, smallest_size, max_batch_size
)
test_size = smallest_size - smallest_size % batch_size
# Create/align the datasets
if use_ts_data:
index_offset = None
start = -test_size
X_test = features.iloc[start - surplus:]
y_test = targets.iloc[start:]
end = start
start = end - test_size
X_validation = features.iloc[start - surplus:end]
y_validation = targets.iloc[start:end]
end = start
start = (total_size + end - surplus) % batch_size
X_train = features.iloc[start:end]
y_train = targets.iloc[start + surplus:end]
else:
index_offset = memory
_features = features.shift(-1)
start = -test_size - memory
X_test = _features.iloc[start:]
y_test = targets.iloc[start:]
end = start + memory
start = end - test_size - memory
X_validation = _features.iloc[start:end]
y_validation = targets.iloc[start:end]
end = start + memory
start = (total_size + end - memory) % batch_size
X_train = _features.iloc[start:end]
y_train = targets.iloc[start:end]
# Record indexes
test_index = y_test.index[index_offset:]
validation_index = y_validation.index[index_offset:]
train_index = y_train.index[index_offset:]
if memory > X_train.shape[0] or memory > X_validation.shape[0]:
insufficient_data()
format_data = ts_data if use_ts_data else ts_gen
train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
validation = format_data(
X_validation.values, y_validation.values, memory, batch_size=batch_size
)
test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)
# Print out the batched data for inspection
def results(dataset, index):
print("\n-------------------\n")
print(f"Index:\n\n", index, "\n\n")
last_i = len(dataset) - 1
for i, batch in enumerate(dataset):
inputs, targets = batch
if i == 0:
print(
f"First:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
if i == last_i:
print(
f"Last:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
print("\n-------------------\n")
results(train, train_index)
results(validation, validation_index)
results(test, test_index)
return (
batch_size,
train,
validation,
test,
train_index,
validation_index,
test_index,
)
# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target#t from the actual target#t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x
batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)
All loss/metric functions rely on y_pred and y_true assume matching indices. There's nothing special that Keras does in the background.

Forecasting stocks with LSTM in Keras (Python 3.7, Tensorflow 2.1.0)

I'm trying to use LSTM to predict how the Dow Jones Industrial Average will perform in coming months. I think it is appropriate to frame this as a time series scenario since the DJIA behaves like a stock, with my data values spread evenly in time. I'm only a beginner, so starting simply with only one feature (daily close value). Now I know that stocks are very random and it's hard to predict them well. And, the close value alone is not very informative... but I'll add other features later.
Dataset: DJIA historical data, Jan 28, 1985 - Jun 24, 2020, downloadable here: https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI.
Visualization with matplotlib:
I use a series of close values (number = 'sequence_length') to predict the close value that immediately follows the series (sequence_length + 1). For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. Put another way, I partition the data such that x_train[0] contains close values for days 0-29, and y_train[0] contains the single value for day 31. Ok. So this is the result I get after running the model on my test data:
Ostensibly great, but I'm wondering if this whole concept is flawed: is the model merely seeing the data repetitively, and not learning any underlying pattern? See below for DJIA close predictions for 7/2020 through 4/20201. It seems to me that the prediction curve mimics the exact shape of the testing data, falling below 20,000 points and all...
Questions
Is this model valid? Is it a matter of changing parameters or reformatting data?
How the heck do you evaluate a model like this? Apparently 'accuracy' is an invalid metric. See below for loss curve
It was suggested that instead of using scalar close values for labels, I use sequences instead. For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60. I have been trying in vain to make this work and apparently have no idea how. I tried to make y_test and y_train Numpy arrays including arrays of sequence data - like this:
y_train, y_test = [], []
for i in range(sequence_length, len(training_set_scaled)):
y_train.append(training_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_test.append(testing_set_scaled[i + sequence_length: sequence_length*2 + i, 0])
y_train = np.array(list(y_item for y_item in y_train))
y_test = np.array(list(y_item for y_item in y_test))
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).
Any help would be SO greatly appreciated, and perhaps we can all benefit ($). Joking... sort of.
The Code
df = pd.read_csv('DJIA_historical_data.csv') # 2D. Shape: (8924 examples, 7 features)
close_data = df['Close'] # 1D (examples, )
dates = df['Date'] # 1D (examples, )
adj_dates = mdates.datestr2num(dates) # Convert Pandas series to np array so matplotlib can plot
# Important parameter
sequence_length: int = 90 # Aka 'timesteps', or number of close values used to make each new prediction
# Split off the training set and scale it.
percent_training: float = 0.80
num_training_samples = ceil(percent_training*len(df)) # A whole number
training_set = df.iloc[:num_training_samples, 5:6].values # 2D, shape: (samples, 1 feature)
scaler = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = scaler.fit_transform(training_set) #Shape is 2D: (num_training_samples, 1)
# Build 3D training set. Final shape: (examples, sequence_length, 1)
x_train = np.array([training_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(training_set_scaled))])
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], 1))
# Build test sets
num_testing_samples: int = len(df) - x_train.shape[0] # Scalar value
testing_set = df.iloc[-num_testing_samples:, 5:6].values # 2D (examples, 1)
testing_set_scaled = scaler.fit_transform(testing_set) # 2D ndarray (examples, 1)
x_test = np.array([testing_set_scaled[i - sequence_length:i, 0] for i in range(sequence_length, len(testing_set_scaled))])
x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], 1)) #3D shape: (examples-sequence_length, sequence_length, 1).
# Build 1D training labels (examples, )
y_train = np.array([training_set_scaled[i, 0] for i in range(sequence_length, len(training_set_scaled))])
y_test = np.array([testing_set_scaled[i, 0] for i in range(sequence_length, len(testing_set_scaled))]) # (examples-sequence_length, 1)
y_test = np.reshape(y_test, (y_test.shape[0])) #1D (examples, )
# Build Model
epochs: int = 150
batch_size: int = 32
LSTM_1 = LSTM(
units = 5, # I reduced model complexity because I thought it would reduce overfitting. No such luck
input_shape = (x_train.shape[1], 1),
return_sequences = False,
)
LSTM_2 = LSTM(
units = 10
)
model = Sequential()
model.add(LSTM_1) # Output shape: (batch_size, sequence_length, units)
model.add(Dropout(0.4))
# model.add(LSTM_2) # Output shape: ?
# model.add(Dropout(0.2))
model.add(Dense(1)) # Is linear activation appropriate here?
model.compile(loss = 'mean_squared_error',
optimizer = 'adam',
)
early_stopping = EarlyStopping(monitor='val_loss',
mode='min',
verbose = 1,
patience = 9,
restore_best_weights = False
)
history = model.fit(x_train,
y_train,
epochs = epochs,
batch_size = batch_size,
verbose = 2,
validation_split = 0.20,
# validation_data = (x_test, y_test),
callbacks = [early_stopping],
)
# Evaluate performance
model.summary()
loss = model.evaluate(x_test, y_test, batch_size = batch_size)
# early_stopping.stopped_epoch returns 0 if training didn't stop early.
print('Training stopped after',early_stopping.stopped_epoch,'epochs.')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss vs. Epoch')
plt.ylabel('loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
prediction = model.predict(x_test)
prediction = scaler.inverse_transform(prediction)
y_test2 = np.reshape(y_test, (y_test.shape[0], 1))
y_test = scaler.inverse_transform(y_test2)
test_dates = adj_dates[-x_test.shape[0]:]
# Visualizing the results
plt.plot_date(test_dates, y_test, '-', linewidth = 2, color = 'red', label = 'Real DJIA Close')
plt.plot(test_dates, prediction, color = 'blue', label = 'Predicted Close')
plt.title('Close Prediction')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate future data
time_horizon = sequence_length
# future_lookback = adj_dates[-time_horizon:]
last_n = x_test[-time_horizon:,:,:] # Find last n number of days
future_prediction = model.predict(last_n)
future_prediction2 = np.reshape(future_prediction, (future_prediction.shape[0], 1))
future_prediction3 = scaler.inverse_transform(future_prediction2)
future_prediction3 = np.reshape(future_prediction3, (future_prediction3.shape[0]))
full_dataset_numpy = np.array(close_data)
all_data = np.append(full_dataset_numpy, future_prediction3)
plt.plot(all_data, color = 'blue', label = 'All data')
plt.title('All data including predictions')
plt.xlabel('Time')
plt.ylabel('DJIA Close')
plt.legend()
plt.show()
# Generate dates for future predictions
# Begin at the last date in the dataset, then add 'time_horizon' many new dates
last_date = dates.iloc[-1] # String
timestamp_list = pd.date_range(last_date, periods = time_horizon).tolist() #List of timestamps
# Convert list of timestamps to list of strings
datestring_list = [i.strftime("%Y-%m-%d") for i in timestamp_list] #List of strings
# Clip first value, which is already included in the dataset
datestring2 = mdates.datestr2num(datestring_list)
plt.plot_date(datestring2, future_prediction3, '-', color = 'blue', label = 'Predicted Close')
plt.title('DJIA Close Prediction')
plt.xlabel('Date')
plt.ylabel('Predicted Close')
plt.xticks(rotation = 45)
plt.legend()
plt.show()
Case 1: At the start of your question, you mentioned "For example, I use days 0-29 to predict day 30, days 1-30 to predict day 31, etc. ".
Case 2: But in Question 3, you mentioned "For example, x_train[0] might include close values for days 0-29, and y_train[0] would include close values for days 30-60.".
Do you want to predict Closed Value of Next Day, or Closed Value of Next 30 Days.
For generating the Data for X and Y (Train and Test), you can use the function mentioned below:
def univariate_data(dataset, start_index, end_index, history_size, target_size):
data = []
labels = []
start_index = start_index + history_size
if end_index is None:
end_index = len(dataset) - target_size
for i in range(start_index, end_index):
indices = range(i-history_size, i)
# Reshape data from (history_size,) to (history_size, 1)
data.append(np.reshape(dataset[indices], (history_size, 1)))
labels.append(dataset[i+target_size])
return np.array(data), np.array(labels)
The Value of the argument, history_size will be 30 and the value of target_size will be 1 for Case 1 and 30 for Case 2 (mentioned above).
You need to call that function once for Training and once for Testing as shown below:
univariate_past_history = 30
univariate_future_target = 1 or 30
x_train_uni, y_train_uni = univariate_data(data, 0, TRAIN_SPLIT,
univariate_past_history,
univariate_future_target)
x_val_uni, y_val_uni = univariate_data(data, TRAIN_SPLIT, None,
univariate_past_history,
univariate_future_target)
Please find this Tensorflow Tutorial which explains both Univariate (One Column) and Multi Variate (multiple columns) Time Series Analysis along with step by step Code, comprehensively.
Answering your questions in the sequence which you have asked:
Yes. Referring the Tutorial will help.
Yes, Accuracy is an invalid metric. You can use MAE or MSE, as shown below:
simple_lstm_model.compile(optimizer='adam', loss='mae')
We should use Numpy Arrays instead of Sequences.
Please let me know if you face any other issue and we will be Happy to help you.

Keras: Predicting New Values using predict_generator

At the moment I'm trying to follow a example of Temperature Forecasting in Keras (as given in chapter 6.3 of F. Chollet's "Deep Learning with Python" book). I'm having some issues with prediction using the generator that is specified. My understanding is that I should be using model.predict_generator for prediction, but I'm unsure how to use the steps parameter for this method and how to get back predictions that are the correct "shape" for my original data.
Ideally, I would like to be able to plot the test set (indices 300001 until the end) and also plot my predictions for this test set (i.e. an array of the same length with predicted values).
An example (Dataset available here: https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip) is as follows:
import numpy as np
# Read in data
fname = ('jena_climate_2009_2016.csv')
f = open(fname)
data = f.read()
f.close()
lines = data.split('\n')
col_names = lines[0].split(',')
col_names = [i.replace('"', "") for i in col_names]
# Normalize the data
float_data = np.array(df.iloc[:, 1:])
temp = float_data[:, 1]
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std
def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(
min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield(samples, targets)
lookback = 720
step = 6
delay = 144
train_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=0, max_index=200000, shuffle=True,
step=step, batch_size=batch_size)
val_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=200001, max_index=300000, step=step,
batch_size=batch_size)
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=batch_size)
val_steps = (300000 - 200001 - lookback)
test_steps = (len(float_data) - 300001 - lookback)
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')
model.fit_generator(train_gen, steps_per_epoch=500,
epochs=20, validation_data=val_gen,
validation_steps=val_steps)
After some searching around online, I tried some techniques similar to the following:
pred = model.predict_generator(test_gen, steps=test_steps // batch_size)
However the prediction array that I got back was far too long and didn't match up to my original data at all. Has anyone got any suggestions?
For anyone looking at the question now, we are not required to specify the steps parameter when using predict_generator for the newer versions of keras. Ref: https://github.com/keras-team/keras/issues/11902
If a value is provided, predictions for step*batch_size examples will be generated. This may result in exclusion of len(test)%batch_size rows, as mentioned by OP.
Also, it seems to me that setting batch_size=1 defeats the purpose of using the generator, as it is equivalent to iterating over the test data one by one.
Similarly setting steps=1 (when batch_size is not set in test_generator) will read the entire test data at once, which is not ideal for large test data.
In predict_generator for steps divide number of images you have in test path with whatever batchsize you have provided in test_gen
EX: i have 50 images and i provided batch size of 10 than steps would be 5
#first seperate the `test images` and `test labels`
test_images,test_labels = next(test_gen)
#get the class indices
test_labels = test_labels[:,0] #this should give you array of labels
predictions = model.predict_generator(test_gen,steps = number of images/batchsize,verbose=0)
predictions[:,0] #this is your actual predictions
Your original code looks correct:
pred = model.predict_generator(test_gen, steps=test_steps // batch_size)
I tried and did not see any problem generating a pred of length around 120k. What size did you get?
Actually both of the steps in the code are incorrect. They should be:
val_steps = (300000 - 200001 - lookback) // batch_size
test_steps = (len(float_data) - 300001 - lookback) // batch_size
(Didn't it take forever for your validation to run for each epoch?)
Of course with this correction you can simply use
pred = model.predict_generator(test_gen, steps=test_steps)
As I arrived at a semi-acceptable version of an answer to my own question, I decided to post it for posterity:
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=1) # "reset" the generator
pred = model.predict_generator(test_gen, steps=test_steps)
This now has the shape I want to plot it against my original test set. I could also use a more manual approach inspired somewhat by this answer:
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=1) # "reset" the generator
truth = []
pred = []
for i in range(test_steps):
x, y = next(test_gen)
pred.append(model.pred(x))
truth.append(y)
pred = np.concatenate(pred)
truth = np.concatenate(truth)

Can't interpret prediction with neural network use

I'm trying to use TensorFlow in python, to make some prediction with cryptocurrency data. The problem is that the output of the prediction is like a 0.1-0.9 number whereas the cryptocurrency data should be a 10000-10100 format, and I don't find a solution to convert the 0.* number to the real one.
I've try to create a ratio, with substrat max - min from predicted values, and max-min from tested data, and divide to have a ratio but when I multiply this ratio with prediction there is a big rate of error ( found a 14000 number instead of a 10000 one )
Here some code :
train_start = 0
train_end = int(np.floor(0.7*n))
test_start = train_end
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]
Scale data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)
Build X and y:
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]
.
.
.
n_data = 10
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1
X = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None, n_data])
Y = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None])
Hidden layer
..
Output layer (must be transposed)
..
Cost function
..
Optimizer
..
Make Session:
sess = tf.compat.v1.Session()
Run initializer:
sess.run(tf.compat.v1.global_variables_initializer())
Setup interactive plot:
plt.ion()
fig = plt.figure()
ax1 = fig.add_subplot(111)
line1, = ax1.plot(y_test)
line2, = ax1.plot(y_test*0.5)
plt.show()
epochs = 10
batch_size = 256
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
sess.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 5) == 0:
# Prediction
pred = sess.run(out, feed_dict={X: X_test})
#This pred var is the output of the prediction
I persiste my result in a file and this is what its looks like :
2019-08-21 06-AM;15310.444858356934;0.50021994;
2019-08-21 12-PM;14287.717187390663;0.46680558;
2019-08-21 06-PM;14104.63871795706;0.46082407;
For example, the last prediction is 0,46 but when I try to convert it I found 14104 whereas it should be nearer a 10000 value
Does anyone have an idea how to convert those predictions?
Thanks!
You will have to make use of inverse_transform of MinMaxScaler to convert back the output you are getting in range of 0-1.
You have not given your model, but I believe you are making use of regression task with few dense layers. You will have to keep minimizing your loss. If you are using mean squared error, the larger the loss, more is the likelihood your output will be far away from the desired set of results.
Even after your loss is a small number and the result is coming good for train samples, but the prediction is bad for test dataset, you may have to consider increasing your train dataset so that more possibilities are covered. If that is not possible, consider reducing the number of neurons in your neural network so that it stops over-fitting.
You can do some postprocessing to restrict the output to some desired range.

Categories