Python Minibatch Dictionary Learning - python

I'd like to implement error tracking with dictionary learning in python, using sklearn's MiniBatchDictionaryLearning, so that I can record how the error decreases over the iterations. I have two methods to do it, neither of which really worked. Set up:
Input data X, numpy array shape (n_samples, n_features) = (298143, 300). These are patches of shape (10, 10), generated from an image of shape (642, 480, 3).
Dictionary learning parameters: No. of columns (or atoms) = 100, alpha = 2, transform algorithm = OMP, total no. of iterations = 500 (keep it small first, just as a test case)
Calculating error: After learning the dictionary, I encode the original image again based on the learnt dictionary. Since both the encoding and the original are numpy arrays of the same shape (642, 480, 3), I'm just doing elementwise Euclidean distance for now:
err = np.sqrt(np.sum(reconstruction - original)**2))
I did a test run with these parameters, and the full fit was able to produce a pretty good reconstruction with a low error, so that's good.Now on to the two methods:
Method 1: Save the learnt dictionary every 100 iterations, and record the error. For 500 iterations, this gives us 5 runs of 100 iterations each. After each run, I compute the error, then use the currently learnt dictionary as an initialization for the next run.
# Fit an initial dictionary, V, as a first run
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = 100,
transform_algorithm='omp')
dl = dico.fit(patches)
V = dl.components_
# Now do another 4 runs.
# Note the warm restart parameter, dict_init = V.
for i in range(n_runs):
print("Run %s..." % i, end = "")
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = n_iterations,
transform_algorithm='omp',
dict_init = V)
dl = dico.fit(patches)
V = dl.components_
img_r = reconstruct_image(dico, V, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("Err = %s" % err)
Problem: The error isn't decreasing, and was pretty high. The dictionary wasn't learnt very well either.
Method 2: Cut the input data X into, say, 500 batches, and do partial fitting, using the partial_fit() method.
batch_size = 500
n_batches = X.shape[0] // batch_size
print(n_batches) # 596
for iternum in range(n_batches):
batch = patches[iternum*batch_size : (iternum+1)*batch_size]
V = dico.partial_fit(batch)
Problem: this seems to take about 5000 times longer.
I'd like to know if there's a way to retrieve the error over the fitting process?

Each call to fit re-initializes the model and forgets any previous call to fit: this is the expected behavior of all estimators in scikit-learn.
I think using partial_fit in a loop is the right solution but you should call it on small batches (as done in the fit method method, the default batch_size value is just 3) and then only compute the cost every 100 or 1000 calls to partial_fit for instance:
batch_size = 3
n_epochs = 20
n_batches = X.shape[0] // batch_size
print(n_batches) # 596
n_updates = 0
for epoch in range(n_epochs):
for i in range(n_batches):
batch = patches[i * batch_size:(i + 1) * batch_size]
dico.partial_fit(batch)
n_updates += 1
if n_updates % 100 == 0:
img_r = reconstruct_image(dico, dico.components_, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("[epoch #%02d] Err = %s" % (epoch, err))

I ran through the same problem and finally I was able to make the code much faster. If it's still useful to someone, adding the solution here. The catch is that while constructing the MiniBatchDictionaryLearning object we need to set n_iter to a low value (e.g., 1), so that for each partial_fit it does not run a single batch for too many epochs.
# Construct an initial dictionary object, note partial fit will be done later inside
# the loop, here we only specify that for partial_fit it needs just to run just 1
# epoch (n_iter=1) with batch_size=batch_size on the current batch provided
# (otherwise by default it can run upto 1000 iterations with batch_size=3 for a
# single partial_fit() and on each of the batches, which makes the a single run of
# partial_fit() very slow. Since we control the epoch on our own and it restarts
# when all the batches are done, we need not provide more than 1 iteration here.
# This will make the code to execute fast.
batch_size = 128 # e.g.,
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = 1, # epoch per partial_fit()
batch_size = batch_size,
transform_algorithm='omp')
followed by #ogrisel's code:
n_updates = 0
for epoch in range(n_epochs):
for i in range(n_batches):
batch = patches[i * batch_size:(i + 1) * batch_size]
dico.partial_fit(batch)
n_updates += 1
if n_updates % 100 == 0:
img_r = reconstruct_image(dico, dico.components_, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("[epoch #%02d] Err = %s" % (epoch, err))

Related

Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively

I have multiple input features and a singular target feature that correspond 1:1 to each other's index; meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets: input[t] <=> target[t]. Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.
Under normal operating procedures, I would use N periods worth of past data in order to predict 1 future value, N periods ahead. As the frame shifts forward in time, each respective slot is filled with the [t+N] forecast, recorded at [t].
Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array or TimeseriesGenerator to batch my data (based on system support). I need to know if the implementation I made produces batches that will do what I expect when running model.fit() in keras. I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.
I'm using an LSTM potentially with the stateful argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency). I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes. On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.
My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array/TimeseriesGenerator such that running model.fit() will train using losses/metrics that compare the target at time [t] with the predicted value at time [t] using inputs at time [t].
import pandas as pd
import numpy as np
use_ts_data = True
try:
# Comment this line out if you want to test timeseries_dataset_from_array
raise ImportError("No TDFA for you")
from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen
use_ts_data = False
def gp2(size):
return np.power(2, int(np.log2((size))))
def train_validate_test_split(
features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
def batch_size_with_buffer(buffer, available, desired, max_batch_size):
batch_size = gp2(min(desired, max_batch_size or np.inf))
if available < batch_size * 3 + buffer:
# If we don't have enough records to support this batch_size, use 1 power lower
batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
return int(batch_size)
memory = max(1, memory)
surplus = memory - 1
test_size_ratio = 1 - train_size_ratio
total_size = features.shape[0]
smallest_size = int(total_size * test_size_ratio / 2)
# Error on insufficient data
def insufficient_data():
raise RuntimeError(
f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
)
if total_size < memory + 3:
insufficient_data()
# Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
batch_size = batch_size_with_buffer(
surplus, total_size, smallest_size, max_batch_size
)
test_size = smallest_size - smallest_size % batch_size
# Create/align the datasets
if use_ts_data:
index_offset = None
start = -test_size
X_test = features.iloc[start - surplus:]
y_test = targets.iloc[start:]
end = start
start = end - test_size
X_validation = features.iloc[start - surplus:end]
y_validation = targets.iloc[start:end]
end = start
start = (total_size + end - surplus) % batch_size
X_train = features.iloc[start:end]
y_train = targets.iloc[start + surplus:end]
else:
index_offset = memory
_features = features.shift(-1)
start = -test_size - memory
X_test = _features.iloc[start:]
y_test = targets.iloc[start:]
end = start + memory
start = end - test_size - memory
X_validation = _features.iloc[start:end]
y_validation = targets.iloc[start:end]
end = start + memory
start = (total_size + end - memory) % batch_size
X_train = _features.iloc[start:end]
y_train = targets.iloc[start:end]
# Record indexes
test_index = y_test.index[index_offset:]
validation_index = y_validation.index[index_offset:]
train_index = y_train.index[index_offset:]
if memory > X_train.shape[0] or memory > X_validation.shape[0]:
insufficient_data()
format_data = ts_data if use_ts_data else ts_gen
train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
validation = format_data(
X_validation.values, y_validation.values, memory, batch_size=batch_size
)
test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)
# Print out the batched data for inspection
def results(dataset, index):
print("\n-------------------\n")
print(f"Index:\n\n", index, "\n\n")
last_i = len(dataset) - 1
for i, batch in enumerate(dataset):
inputs, targets = batch
if i == 0:
print(
f"First:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
if i == last_i:
print(
f"Last:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
print("\n-------------------\n")
results(train, train_index)
results(validation, validation_index)
results(test, test_index)
return (
batch_size,
train,
validation,
test,
train_index,
validation_index,
test_index,
)
# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target#t from the actual target#t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x
batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)
All loss/metric functions rely on y_pred and y_true assume matching indices. There's nothing special that Keras does in the background.

avoiding iteration using smart dot/matmul on large dataset

Im trying to write a multi-class perceptron algorithm for the MNIST dataset.
now I have the following code which works, but due to the fact its iterating 60k times it works slowly.
weights is the size - (785,10)
def multiClassPLA(train_data, train_labels, weights):
epoch_err = [] # will hold the misclassified ratio for each epoch
best_weights = weights
best_error = 1
for epoch in range(EPOCH):
err = 0
# randomize the data before each epoch
train_data, train_labels = randomizeData(train_data, train_labels)
for x, y in zip(train_data, train_labels):
h = oneVsAllLabeling_(np.dot(weights, x))
diff = (y - h) / 2
x = x.reshape(1, x.shape[0])
diff = diff.reshape(CLASSES, 1)
update_step = ETA * np.dot(diff, x)
weights += update_step
return weights
oneVsAllLabeling_(X) function returns a vector which contain 1 at the argmax and -1 elsewhere. the truth labels has the same form of course.
with this algorithm I'm getting ~90% accuracy, safe but slowly.
after further exploration of the problem I found that I can improve the code using array/matrix multiplication.
so I've started to do the following:
def oneVsAllLabeling(X):
idx = np.argmax(X, axis=1)
mask = np.zeros(X.shape, dtype=bool)
mask[np.arange(len(idx)),idx] = 1
out = 2 * mask - 1
return out.astype(int)
def zeroOneError(prediction):
tester = np.zeros((1, CLASSES))
good_prediction = len(np.where(prediction == tester))
return len(prediction) - good_prediction
def preceptronModelFitting(data, weights, labels, to_print, epoch=None):
prediction = np.matmul(data, weights)
prediction = oneVsAllLabeling(prediction)
diff = (prediction - labels) / 2
error = zeroOneError(diff)
accuracy = error / len(data)
if to_print:
print("Epoch: {}. Loss: {}. Accuracy: {}".format(epoch, error, accuracy))
return prediction, error, accuracy
def multiClassPLA2(train_data, train_labels, test_data, test_labels, weights):
predicted_output = np.zeros((1, CLASSES))
train_loss_vec = np.array([])
train_accuracy_vec = np.array([])
test_loss_vec = np.array([])
test_accuracy_vec = np.array([])
for epoch in range(EPOCH):
# randomize the data before each epoch
train_data, train_labels = randomizeData(train_data, train_labels)
train_prediction, train_error, train_accuracy = preceptronModelFitting(train_data, weights, train_labels, to_print=False)
return weights
after calling preceptronModelFitting() I get a matrix the in the size (60k,10) which every entry has the following shape:
train_prediction[0]=[0,0,1,0,0,-1,0,0,0,0]
and the data has the shape (60k, 785)
now what I need to do is, if possible, to multiply each row with each of the data entries and sum so that in total what ill get is a matrix the size (785,10) which I can update with it the old set of weights.
which its almost equivalent to what I do in the not efficient algorithm, the only differance is that I update the weights every new data entry instead of after seeing all the data.
Thanks!
OK you did most of the job done, and even you had part of the answer in your title.
np.matmul(X.T, truth - prediction)
Using this, it will get you what you want in a one line.
notice that this based on the fact that indeed truth, prediction, X are as you mentioned.

tf.data API cannot print all the batches

I am self-teaching myself about tf.data API. I am using MNIST dataset for binary classification. The training x and y data is zipped together in the full train_dataset. Chained along together with this zip method is first the batch() dataset method. the data is batched with a batch size of 30. Since my training set size is 11623, with batch size 128, I will have 91 batches. The size of the last batch will be 103 which is fine since this is LSTM. Additionally, I am using drop-out. When I compute batch accuracy, I am turning off the drop-out.
The full code is given below:
#Ignore the warnings
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (8,7)
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")
Xtrain = mnist.train.images[mnist.train.labels < 2]
ytrain = mnist.train.labels[mnist.train.labels < 2]
print(Xtrain.shape)
print(ytrain.shape)
#Data parameters
num_inputs = 28
num_classes = 2
num_steps=28
# create the training dataset
Xtrain = tf.data.Dataset.from_tensor_slices(Xtrain).map(lambda x: tf.reshape(x,(num_steps, num_inputs)))
# apply a one-hot transformation to each label for use in the neural network
ytrain = tf.data.Dataset.from_tensor_slices(ytrain).map(lambda z: tf.one_hot(z, num_classes))
# zip the x and y training data together and batch and Prefetch data for faster consumption
train_dataset = tf.data.Dataset.zip((Xtrain, ytrain)).batch(128).prefetch(128)
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,train_dataset.output_shapes)
X, y = iterator.get_next()
training_init_op = iterator.make_initializer(train_dataset)
#### model is here ####
#Network parameters
num_epochs = 2
batch_size = 128
output_keep_var = 0.5
with tf.Session() as sess:
init.run()
print("Initialized")
# Training cycle
for epoch in range(0, num_epochs):
num_batch = 0
print ("Epoch: ", epoch)
avg_cost = 0.
avg_accuracy =0
total_batch = int(11623 / batch_size + 1)
sess.run(training_init_op)
while True:
try:
_, miniBatchCost = sess.run([trainer, loss], feed_dict={output_keep_prob: output_keep_var})
miniBatchAccuracy = sess.run(accuracy, feed_dict={output_keep_prob: 1.0})
print('Batch %d: loss = %.2f, acc = %.2f' % (num_batch, miniBatchCost, miniBatchAccuracy * 100))
num_batch +=1
except tf.errors.OutOfRangeError:
break
When I run this code, it seems it is working and printing:
Batch 0: loss = 0.67276, acc = 0.94531
Batch 1: loss = 0.65672, acc = 0.92969
Batch 2: loss = 0.65927, acc = 0.89062
Batch 3: loss = 0.63996, acc = 0.99219
Batch 4: loss = 0.63693, acc = 0.99219
Batch 5: loss = 0.62714, acc = 0.9765
......
......
Batch 39: loss = 0.16812, acc = 0.98438
Batch 40: loss = 0.10677, acc = 0.96875
Batch 41: loss = 0.11704, acc = 0.99219
Batch 42: loss = 0.10592, acc = 0.98438
Batch 43: loss = 0.09682, acc = 0.97656
Batch 44: loss = 0.16449, acc = 1.00000
However, as one can see easily, there is something wrong. Only 45 batches are printed not 91 and I do not know why this is happening. I tried so many things and I think I am missing something out.
I can use repeat() function but I do not want that because I have redundant observations for last batches and I want LSTM to handle it.
This is an annoying pitfall when defining a model based directly on the get_next() output of a tf.data iterator. In your loop, you have two sess.run calls, both of which will advance the iterator by one step. This means each loop iteration actually consumes two batches (and also your loss and accuracy calculations are computed on different batches).
Not entirely sure if there is a "canonical" way of fixing this, but you could
compute the accuracy in the same run call as the cost/training step. This would mean that the accuracy calculation is also affected by the dropout mask, but since it's an approximate value based on only one batch, that shouldn't be a huge issue.
define your model based on a placeholder instead, and in each loop iteration run the get_next op itself, then feed the resulting numpy arrays (i.e. the batch) into the loss/accuracy computations.

For loop to evaluate accuracy doesn't execute

So I've the following numpy arrays.
X validation set, X_val: (47151, 32, 32, 1)
y validation set (labels), y_val_dummy: (47151, 5, 10)
y validation prediction set, y_pred: (47151, 5, 10)
When I run the code, it seems to take forever. Can someone suggest why? I believe it's a code efficiency problem. I can't seem to complete the process.
y_pred_list = model.predict(X_val)
correct_preds = 0
# Iterate over sample dimension
for i in range(X_val.shape[0]):
pred_list_i = [y_pred_array[i] for y_pred in y_pred_array]
val_list_i = [y_val_dummy[i] for y_val in y_val_dummy]
matching_preds = [pred.argmax(-1) == val.argmax(-1) for pred, val in zip(pred_list_i, val_list_i)]
correct_preds = int(np.all(matching_preds))
total_acc = correct_preds / float(x_val.shape[0])
You're main problem is that you're generating a massive number of very large lists for no real reason
for i in range(X_val.shape[0]):
# this line generates a 47151 x 5 x 10 array every time
pred_list_i = [y_pred_array[i] for y_pred in y_pred_array]
What's happening is that iterating over an nd numpy array iterates over the slowest varying index (i.e. the leftmost), so every list comprehension is running operating on 47K entries.
Marginally better would be
for i in range(X_val.shape[0]):
pred_list_i = [y_pred for y_pred in y_pred_array[i]]
val_list_i = [y_val for y_val in y_val_dummy[i]]
matching_preds = [pred.argmax(-1) == val.argmax(-1) for pred, val in zip(pred_list_i, val_list_i)]
correct_preds = int(np.all(matching_preds))
But you're still copying a lot of arrays for no real purpose. The following code should do the same, without the useless copying.
correct_preds = 0.0
for pred, val in zip(y_pred_array, y_val_dummy):
correct_preds += all(p.argmax(-1) == v.argmax(-1)
for p, v in zip(pred, val))
total_accuracy = correct_preds / x_val.shape[0]
This assumes that your criteria for a correct prediction is accurate.
You can probably avoid the explicit loop entirely with a couple of calls to np.argmax, but you'll have to work that out on your own.

Why evaluate self._initial_state when training RNN in Tensorflow

In the RNN tutorial ptd_word_lm.py. When training the RNN using run_epoch, why is it necessary to evaluate self._initial_state?
def run_epoch(session, m, data, eval_op, verbose=False):
"""Runs the model on the given data."""
epoch_size = ((len(data) // m.batch_size) - 1) // m.num_steps
start_time = time.time()
costs = 0.0
iters = 0
state = m.initial_state.eval()
for step, (x, y) in enumerate(reader.ptb_iterator(data, m.batch_size,
m.num_steps)):
cost, state, _ = session.run([m.cost, m.final_state, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})
costs += cost
iters += m.num_steps
if verbose and step % (epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps" %
(step * 1.0 / epoch_size, np.exp(costs / iters),
iters * m.batch_size / (time.time() - start_time)))
return np.exp(costs / iters)
The initial state is defined as following and is never changed during training.
self._initial_state = cell.zero_state(batch_size, tf.float32)
In the PTB example, the sentences are concatenated and split into batches (of size batch_size x num_steps). After each batch, the last state of the RNN is passed as the initial state of the next batch. This effectively allows you to train the RNN as if it was one very long chain over the entire PTB corpus (and this explain why model.final_state is evaluated and why the state is passed into m.initial_state in the feed_dict). So you see that the initial_state actual does change at every step.
At the very beginning of an epoch, we have no previous state to pass as the initial_state and so use all zeros, represented by state = m.initial_state.eval(). Perhaps it would be less confusing if there was another property called m.zero_state that you evaluated to get this initial state. You could, for example, also use a numpy array of zeros of the appropriate size and this would work just fine too. The eval is just a convenient way to get a tensor or zeros of the appropriate size.
Hope this makes sense!

Categories