At the moment I'm trying to follow a example of Temperature Forecasting in Keras (as given in chapter 6.3 of F. Chollet's "Deep Learning with Python" book). I'm having some issues with prediction using the generator that is specified. My understanding is that I should be using model.predict_generator for prediction, but I'm unsure how to use the steps parameter for this method and how to get back predictions that are the correct "shape" for my original data.
Ideally, I would like to be able to plot the test set (indices 300001 until the end) and also plot my predictions for this test set (i.e. an array of the same length with predicted values).
An example (Dataset available here: https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip) is as follows:
import numpy as np
# Read in data
fname = ('jena_climate_2009_2016.csv')
f = open(fname)
data = f.read()
f.close()
lines = data.split('\n')
col_names = lines[0].split(',')
col_names = [i.replace('"', "") for i in col_names]
# Normalize the data
float_data = np.array(df.iloc[:, 1:])
temp = float_data[:, 1]
mean = float_data[:200000].mean(axis=0)
float_data -= mean
std = float_data[:200000].std(axis=0)
float_data /= std
def generator(data, lookback, delay, min_index, max_index, shuffle=False, batch_size=128, step=6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while 1:
if shuffle:
rows = np.random.randint(
min_index + lookback, max_index, size=batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield(samples, targets)
lookback = 720
step = 6
delay = 144
train_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=0, max_index=200000, shuffle=True,
step=step, batch_size=batch_size)
val_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=200001, max_index=300000, step=step,
batch_size=batch_size)
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=batch_size)
val_steps = (300000 - 200001 - lookback)
test_steps = (len(float_data) - 300001 - lookback)
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
model = Sequential()
model.add(layers.Flatten(input_shape=(lookback // step, float_data.shape[-1])))
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(1))
model.compile(optimizer=RMSprop(), loss='mae')
model.fit_generator(train_gen, steps_per_epoch=500,
epochs=20, validation_data=val_gen,
validation_steps=val_steps)
After some searching around online, I tried some techniques similar to the following:
pred = model.predict_generator(test_gen, steps=test_steps // batch_size)
However the prediction array that I got back was far too long and didn't match up to my original data at all. Has anyone got any suggestions?
For anyone looking at the question now, we are not required to specify the steps parameter when using predict_generator for the newer versions of keras. Ref: https://github.com/keras-team/keras/issues/11902
If a value is provided, predictions for step*batch_size examples will be generated. This may result in exclusion of len(test)%batch_size rows, as mentioned by OP.
Also, it seems to me that setting batch_size=1 defeats the purpose of using the generator, as it is equivalent to iterating over the test data one by one.
Similarly setting steps=1 (when batch_size is not set in test_generator) will read the entire test data at once, which is not ideal for large test data.
In predict_generator for steps divide number of images you have in test path with whatever batchsize you have provided in test_gen
EX: i have 50 images and i provided batch size of 10 than steps would be 5
#first seperate the `test images` and `test labels`
test_images,test_labels = next(test_gen)
#get the class indices
test_labels = test_labels[:,0] #this should give you array of labels
predictions = model.predict_generator(test_gen,steps = number of images/batchsize,verbose=0)
predictions[:,0] #this is your actual predictions
Your original code looks correct:
pred = model.predict_generator(test_gen, steps=test_steps // batch_size)
I tried and did not see any problem generating a pred of length around 120k. What size did you get?
Actually both of the steps in the code are incorrect. They should be:
val_steps = (300000 - 200001 - lookback) // batch_size
test_steps = (len(float_data) - 300001 - lookback) // batch_size
(Didn't it take forever for your validation to run for each epoch?)
Of course with this correction you can simply use
pred = model.predict_generator(test_gen, steps=test_steps)
As I arrived at a semi-acceptable version of an answer to my own question, I decided to post it for posterity:
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=1) # "reset" the generator
pred = model.predict_generator(test_gen, steps=test_steps)
This now has the shape I want to plot it against my original test set. I could also use a more manual approach inspired somewhat by this answer:
test_gen = generator(float_data, lookback=lookback, delay=delay,
min_index=300001, max_index=None, step=step,
batch_size=1) # "reset" the generator
truth = []
pred = []
for i in range(test_steps):
x, y = next(test_gen)
pred.append(model.pred(x))
truth.append(y)
pred = np.concatenate(pred)
truth = np.concatenate(truth)
Related
I created this method to augment my time series dataset:
my_augmenter = (
tsaug.AddNoise(scale=(0.001, 0.5)) # 0.5
+ tsaug.Convolve(window="flattop", size=5)
+ tsaug.Drift(max_drift=(0, 0.1))
+ tsaug.TimeWarp(40)
)
def augment(x_train):
x_aug = []
for i in range(x_train.shape[0]):
temp = my_augmenter.augment(x_train[i])
x_aug.append(temp)
x_aug = np.array(x_aug)
return x_aug
And I want to perform the classical keras fit method to train my neural network, using augment() method to generate a new augmented training dataset at each epoch:
history = model.fit(
x = augment(xtr),
y = ytr,
batch_size = batch_size,
epochs = epochs,
validation_data = (xval, yval)
).history
But I have the impression that the augmentation runs only for the first epoch and then it uses the same augmented xtr for the whole fit. How can I solve?
I have multiple input features and a singular target feature that correspond 1:1 to each other's index; meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets: input[t] <=> target[t]. Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.
Under normal operating procedures, I would use N periods worth of past data in order to predict 1 future value, N periods ahead. As the frame shifts forward in time, each respective slot is filled with the [t+N] forecast, recorded at [t].
Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array or TimeseriesGenerator to batch my data (based on system support). I need to know if the implementation I made produces batches that will do what I expect when running model.fit() in keras. I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.
I'm using an LSTM potentially with the stateful argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency). I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes. On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.
My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array/TimeseriesGenerator such that running model.fit() will train using losses/metrics that compare the target at time [t] with the predicted value at time [t] using inputs at time [t].
import pandas as pd
import numpy as np
use_ts_data = True
try:
# Comment this line out if you want to test timeseries_dataset_from_array
raise ImportError("No TDFA for you")
from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen
use_ts_data = False
def gp2(size):
return np.power(2, int(np.log2((size))))
def train_validate_test_split(
features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
def batch_size_with_buffer(buffer, available, desired, max_batch_size):
batch_size = gp2(min(desired, max_batch_size or np.inf))
if available < batch_size * 3 + buffer:
# If we don't have enough records to support this batch_size, use 1 power lower
batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
return int(batch_size)
memory = max(1, memory)
surplus = memory - 1
test_size_ratio = 1 - train_size_ratio
total_size = features.shape[0]
smallest_size = int(total_size * test_size_ratio / 2)
# Error on insufficient data
def insufficient_data():
raise RuntimeError(
f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
)
if total_size < memory + 3:
insufficient_data()
# Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
batch_size = batch_size_with_buffer(
surplus, total_size, smallest_size, max_batch_size
)
test_size = smallest_size - smallest_size % batch_size
# Create/align the datasets
if use_ts_data:
index_offset = None
start = -test_size
X_test = features.iloc[start - surplus:]
y_test = targets.iloc[start:]
end = start
start = end - test_size
X_validation = features.iloc[start - surplus:end]
y_validation = targets.iloc[start:end]
end = start
start = (total_size + end - surplus) % batch_size
X_train = features.iloc[start:end]
y_train = targets.iloc[start + surplus:end]
else:
index_offset = memory
_features = features.shift(-1)
start = -test_size - memory
X_test = _features.iloc[start:]
y_test = targets.iloc[start:]
end = start + memory
start = end - test_size - memory
X_validation = _features.iloc[start:end]
y_validation = targets.iloc[start:end]
end = start + memory
start = (total_size + end - memory) % batch_size
X_train = _features.iloc[start:end]
y_train = targets.iloc[start:end]
# Record indexes
test_index = y_test.index[index_offset:]
validation_index = y_validation.index[index_offset:]
train_index = y_train.index[index_offset:]
if memory > X_train.shape[0] or memory > X_validation.shape[0]:
insufficient_data()
format_data = ts_data if use_ts_data else ts_gen
train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
validation = format_data(
X_validation.values, y_validation.values, memory, batch_size=batch_size
)
test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)
# Print out the batched data for inspection
def results(dataset, index):
print("\n-------------------\n")
print(f"Index:\n\n", index, "\n\n")
last_i = len(dataset) - 1
for i, batch in enumerate(dataset):
inputs, targets = batch
if i == 0:
print(
f"First:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
if i == last_i:
print(
f"Last:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
print("\n-------------------\n")
results(train, train_index)
results(validation, validation_index)
results(test, test_index)
return (
batch_size,
train,
validation,
test,
train_index,
validation_index,
test_index,
)
# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target#t from the actual target#t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x
batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)
All loss/metric functions rely on y_pred and y_true assume matching indices. There's nothing special that Keras does in the background.
I have a big dataset, its initially dimension is (453732,839). In this dataset, I have subgroups that are related each other and those subgroups have variable dimension. Since I have to train an LSTM, each subgroup must be the same size, so I apply padding to each subgroup so that they are all the same length.
After padding, the dataset becomes about 2000000 rows.
So I'm executing the model.fit() function within a loop where the model.fit() is executed one for each part of the dataset. Inside the loop I'm padding online the part of the dataset to pass to the model.fit(), but at the second part, before the model.fit(), RAM fills up and I can't continue training.
This is the code in which I pad and fit the model:
training_set_portion_size = int(training_dataset.shape[0] / 6)
start_portion_index = 0
for epoch in range(0, 50):
for part in range(0, 4):
end_portion_index = start_portion_index + training_set_portion_size
training_set_portion = training_dataset[start_portion_index:end_portion_index]
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
portion_groups = get_groups_id_count(training_set_portion[:,0])
# Scale dataset portion
training_set_portion = scaler.transform(training_set_portion[:,0:training_set_portion.shape[1]-1])
training_set_portion = np.concatenate((training_set_portion, training_set_portion_labels[:, np.newaxis]), axis=1)
# Pad dataset portion
training_set_portion = pad_groups(training_set_portion, portion_groups)
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
# Exluding group and label from training_set_portion
training_set_portion = training_set_portion[:, 1:training_set_portion.shape[1] - 1]
# Reshape data for LSTM
training_set_portion = training_set_portion.reshape(int(training_set_portion.shape[0]/timesteps), timesteps, features)
training_set_portion_labels = training_set_portion_labels.reshape(int(training_set_portion_labels.shape[0]/timesteps), timesteps)
model.fit(training_set_portion, training_set_portion_labels, validation_split=0.2, shuffle=False, epochs=1,
batch_size=1, workers=0, max_queue_size=1, verbose=1)
* **UPDATE ***
I'm using pandas now, with chunksize, but seems the tensors are concatenating in the loop.
pandas iterator:
training_dataset_iterator = pd.read_csv('/content/drive/My Drive/Tesi_magistrale/en-train.H',
chunksize=80000, sep=",", header=None, dtype=np.float64)
New code:
for epoch in range(0, 50):
for chunk in training_dataset_iterator:
training_set_portion = chunk.values
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
portion_groups = get_groups_id_count(training_set_portion[:,0])
# Scale dataset portion
training_set_portion = scaler.transform(training_set_portion[:,0:training_set_portion.shape[1]-1])
training_set_portion = np.concatenate((training_set_portion, training_set_portion_labels[:, np.newaxis]), axis=1)
# Pad dataset portion
print('Padding portion...\n')
training_set_portion = pad_groups(training_set_portion, portion_groups)
training_set_portion_labels = training_set_portion[:, training_set_portion.shape[1]-1]
# Exluding group and label from training_set_portion
training_set_portion = training_set_portion[:, 1:training_set_portion.shape[1] - 1]
# Reshape data for LSTM
training_set_portion = training_set_portion.reshape(int(training_set_portion.shape[0]/timesteps), timesteps, features)
training_set_portion_labels = training_set_portion_labels.reshape(int(training_set_portion_labels.shape[0]/timesteps), timesteps)
print('Training set portion shape: ', training_set_portion.shape)
model.fit(training_set_portion, training_set_portion_labels, validation_split=0.2, shuffle=False, epochs=1,
batch_size=1, workers=0, max_queue_size=1, verbose=1)
The first print('Training set portion shape: ', training_set_portion.shape) gave me (21327, 20, 837), but the second gave me (43194, 20, 837). I don't understand why.
UPDATE 2
I notice that training_set_portion = pad_groups(training_set_portion, portion_groups), in some way, duplicate data.
Pad groups code:
def pad_groups(dataset, groups):
max_subtree_length= 20
start = 0
rows, cols = dataset.shape
padded_dataset = []
index = 1
for group in groups:
pad = [group[0]] + [0] * (cols - 1)
stop = start + group[1]
subtree = dataset[start:stop].tolist()
padded_dataset.extend(subtree)
subtree_to_pad = max_subtree_length - group[1]
pads = [pad] * subtree_to_pad
padded_dataset.extend(pads)
start = stop
index+=1
padded_dataset = np.array(padded_dataset)
return padded_dataset
How can I do that?
Thank you in advance.
I found a link in TowardsDataScience where they show you 3 methods to fix this problem by using a small library called pandas which is widely used for dataset processing. I hope it is of some help to solve your problem. Here is the link:-
https://towardsdatascience.com/3-simple-ways-to-handle-large-data-with-pandas-d9164a3c02c1
Regards,
Neel Gupta
I solved my issues: there was a bug in my code.
I'm trying to use TensorFlow in python, to make some prediction with cryptocurrency data. The problem is that the output of the prediction is like a 0.1-0.9 number whereas the cryptocurrency data should be a 10000-10100 format, and I don't find a solution to convert the 0.* number to the real one.
I've try to create a ratio, with substrat max - min from predicted values, and max-min from tested data, and divide to have a ratio but when I multiply this ratio with prediction there is a big rate of error ( found a 14000 number instead of a 10000 one )
Here some code :
train_start = 0
train_end = int(np.floor(0.7*n))
test_start = train_end
test_end = n
data_train = data[np.arange(train_start, train_end), :]
data_test = data[np.arange(test_start, test_end), :]
Scale data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_train = scaler.fit_transform(data_train)
data_test = scaler.transform(data_test)
Build X and y:
X_train = data_train[:, 1:]
y_train = data_train[:, 0]
X_test = data_test[:, 1:]
y_test = data_test[:, 0]
.
.
.
n_data = 10
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1
X = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None, n_data])
Y = tf.compat.v1.placeholder(dtype=tf.compat.v1.float32, shape=[None])
Hidden layer
..
Output layer (must be transposed)
..
Cost function
..
Optimizer
..
Make Session:
sess = tf.compat.v1.Session()
Run initializer:
sess.run(tf.compat.v1.global_variables_initializer())
Setup interactive plot:
plt.ion()
fig = plt.figure()
ax1 = fig.add_subplot(111)
line1, = ax1.plot(y_test)
line2, = ax1.plot(y_test*0.5)
plt.show()
epochs = 10
batch_size = 256
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
sess.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 5) == 0:
# Prediction
pred = sess.run(out, feed_dict={X: X_test})
#This pred var is the output of the prediction
I persiste my result in a file and this is what its looks like :
2019-08-21 06-AM;15310.444858356934;0.50021994;
2019-08-21 12-PM;14287.717187390663;0.46680558;
2019-08-21 06-PM;14104.63871795706;0.46082407;
For example, the last prediction is 0,46 but when I try to convert it I found 14104 whereas it should be nearer a 10000 value
Does anyone have an idea how to convert those predictions?
Thanks!
You will have to make use of inverse_transform of MinMaxScaler to convert back the output you are getting in range of 0-1.
You have not given your model, but I believe you are making use of regression task with few dense layers. You will have to keep minimizing your loss. If you are using mean squared error, the larger the loss, more is the likelihood your output will be far away from the desired set of results.
Even after your loss is a small number and the result is coming good for train samples, but the prediction is bad for test dataset, you may have to consider increasing your train dataset so that more possibilities are covered. If that is not possible, consider reducing the number of neurons in your neural network so that it stops over-fitting.
You can do some postprocessing to restrict the output to some desired range.
Im trying to write a multi-class perceptron algorithm for the MNIST dataset.
now I have the following code which works, but due to the fact its iterating 60k times it works slowly.
weights is the size - (785,10)
def multiClassPLA(train_data, train_labels, weights):
epoch_err = [] # will hold the misclassified ratio for each epoch
best_weights = weights
best_error = 1
for epoch in range(EPOCH):
err = 0
# randomize the data before each epoch
train_data, train_labels = randomizeData(train_data, train_labels)
for x, y in zip(train_data, train_labels):
h = oneVsAllLabeling_(np.dot(weights, x))
diff = (y - h) / 2
x = x.reshape(1, x.shape[0])
diff = diff.reshape(CLASSES, 1)
update_step = ETA * np.dot(diff, x)
weights += update_step
return weights
oneVsAllLabeling_(X) function returns a vector which contain 1 at the argmax and -1 elsewhere. the truth labels has the same form of course.
with this algorithm I'm getting ~90% accuracy, safe but slowly.
after further exploration of the problem I found that I can improve the code using array/matrix multiplication.
so I've started to do the following:
def oneVsAllLabeling(X):
idx = np.argmax(X, axis=1)
mask = np.zeros(X.shape, dtype=bool)
mask[np.arange(len(idx)),idx] = 1
out = 2 * mask - 1
return out.astype(int)
def zeroOneError(prediction):
tester = np.zeros((1, CLASSES))
good_prediction = len(np.where(prediction == tester))
return len(prediction) - good_prediction
def preceptronModelFitting(data, weights, labels, to_print, epoch=None):
prediction = np.matmul(data, weights)
prediction = oneVsAllLabeling(prediction)
diff = (prediction - labels) / 2
error = zeroOneError(diff)
accuracy = error / len(data)
if to_print:
print("Epoch: {}. Loss: {}. Accuracy: {}".format(epoch, error, accuracy))
return prediction, error, accuracy
def multiClassPLA2(train_data, train_labels, test_data, test_labels, weights):
predicted_output = np.zeros((1, CLASSES))
train_loss_vec = np.array([])
train_accuracy_vec = np.array([])
test_loss_vec = np.array([])
test_accuracy_vec = np.array([])
for epoch in range(EPOCH):
# randomize the data before each epoch
train_data, train_labels = randomizeData(train_data, train_labels)
train_prediction, train_error, train_accuracy = preceptronModelFitting(train_data, weights, train_labels, to_print=False)
return weights
after calling preceptronModelFitting() I get a matrix the in the size (60k,10) which every entry has the following shape:
train_prediction[0]=[0,0,1,0,0,-1,0,0,0,0]
and the data has the shape (60k, 785)
now what I need to do is, if possible, to multiply each row with each of the data entries and sum so that in total what ill get is a matrix the size (785,10) which I can update with it the old set of weights.
which its almost equivalent to what I do in the not efficient algorithm, the only differance is that I update the weights every new data entry instead of after seeing all the data.
Thanks!
OK you did most of the job done, and even you had part of the answer in your title.
np.matmul(X.T, truth - prediction)
Using this, it will get you what you want in a one line.
notice that this based on the fact that indeed truth, prediction, X are as you mentioned.