For loop to evaluate accuracy doesn't execute - python

So I've the following numpy arrays.
X validation set, X_val: (47151, 32, 32, 1)
y validation set (labels), y_val_dummy: (47151, 5, 10)
y validation prediction set, y_pred: (47151, 5, 10)
When I run the code, it seems to take forever. Can someone suggest why? I believe it's a code efficiency problem. I can't seem to complete the process.
y_pred_list = model.predict(X_val)
correct_preds = 0
# Iterate over sample dimension
for i in range(X_val.shape[0]):
pred_list_i = [y_pred_array[i] for y_pred in y_pred_array]
val_list_i = [y_val_dummy[i] for y_val in y_val_dummy]
matching_preds = [pred.argmax(-1) == val.argmax(-1) for pred, val in zip(pred_list_i, val_list_i)]
correct_preds = int(np.all(matching_preds))
total_acc = correct_preds / float(x_val.shape[0])

You're main problem is that you're generating a massive number of very large lists for no real reason
for i in range(X_val.shape[0]):
# this line generates a 47151 x 5 x 10 array every time
pred_list_i = [y_pred_array[i] for y_pred in y_pred_array]
What's happening is that iterating over an nd numpy array iterates over the slowest varying index (i.e. the leftmost), so every list comprehension is running operating on 47K entries.
Marginally better would be
for i in range(X_val.shape[0]):
pred_list_i = [y_pred for y_pred in y_pred_array[i]]
val_list_i = [y_val for y_val in y_val_dummy[i]]
matching_preds = [pred.argmax(-1) == val.argmax(-1) for pred, val in zip(pred_list_i, val_list_i)]
correct_preds = int(np.all(matching_preds))
But you're still copying a lot of arrays for no real purpose. The following code should do the same, without the useless copying.
correct_preds = 0.0
for pred, val in zip(y_pred_array, y_val_dummy):
correct_preds += all(p.argmax(-1) == v.argmax(-1)
for p, v in zip(pred, val))
total_accuracy = correct_preds / x_val.shape[0]
This assumes that your criteria for a correct prediction is accurate.
You can probably avoid the explicit loop entirely with a couple of calls to np.argmax, but you'll have to work that out on your own.

Related

Pytorch getting accuracy of train_loop doesn't work

I want to get the accuracy of my train section of my neuronal network
But i get this error:
correct += (prediction.argmax(1) == y).type(torch.float).item()
ValueError: only one element tensors can be converted to Python scalars
With this code :
def train_loop(dataloader, model, optimizer):
model.train()
size = len(dataloader.dataset)
correct = 0, 0
l_loss = 0
for batch, (X, y) in enumerate(dataloader):
prediction = model(X)
loss = cross_entropy(prediction, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
correct += (prediction.argmax(1) == y).type(torch.float).sum().item()
loss, current = loss.item(), batch * len(X)
l_loss = loss
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
correct /= size
accu = 100 * correct
train_loss.append(l_loss)
train_accu.append(accu)
print(f"Accuracy: {accu:>0.1f}%")
I don't understand why it is not working becaus in my test section it work perfektly fine with execly the same code line.
item function is used to convert a one-element tensor to a standard python number as stated in the here. Please try to make sure that the result of the sum() is only a one-element tensor before using item().
x = torch.tensor([1.0,2.0]) # a tensor contains 2 elements
x.item()
error message: ValueError: only one element tensors can be converted to Python scalars
Try to use this:
prediction = prediction.argmax(1)
correct = prediction.eq(y)
correct = correct.sum()
print(correct) # to check if it is a one value tensor
correct_sum += correct.item()

Aligning batched sliding frame timeseries data for tensorflow/keras using timeseries_dataset_from_array and TimeseriesGenerator respectively

I have multiple input features and a singular target feature that correspond 1:1 to each other's index; meaning there should be no forward-looking or backward-looking when it comes to comparing inputs to targets: input[t] <=> target[t]. Essentially, I have already time-shifted my targets backwards to their corresponding input indexes for training purposes.
Under normal operating procedures, I would use N periods worth of past data in order to predict 1 future value, N periods ahead. As the frame shifts forward in time, each respective slot is filled with the [t+N] forecast, recorded at [t].
Now, based on whatever environment I'm developing in, I will need to use either timeseries_dataset_from_array or TimeseriesGenerator to batch my data (based on system support). I need to know if the implementation I made produces batches that will do what I expect when running model.fit() in keras. I'm unsure of whether or not keras is internally shifting data during fitting that I'm unaware of that might lead to poor results.
I'm using an LSTM potentially with the stateful argument so I need to ensure my batches are a perfect fit, and I also wanted to ensure the batch sizes are a factor of 2 (according to some posts regarding processor efficiency). I've tried implementing my own function for making this happen given a few additional assumptions regarding validation/test sizes. On the surface it appears that everything looks good, but since I'm unsure of keras' internals I don't know if I've made a blunder.
My question is whether or not I've properly aligned/batched the inputs and targets using timeseries_dataset_from_array/TimeseriesGenerator such that running model.fit() will train using losses/metrics that compare the target at time [t] with the predicted value at time [t] using inputs at time [t].
import pandas as pd
import numpy as np
use_ts_data = True
try:
# Comment this line out if you want to test timeseries_dataset_from_array
raise ImportError("No TDFA for you")
from tensorflow.keras.preprocessing import timeseries_dataset_from_array as ts_data
except (ModuleNotFoundError, ImportError):
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator as ts_gen
use_ts_data = False
def gp2(size):
return np.power(2, int(np.log2((size))))
def train_validate_test_split(
features, targets, train_size_ratio=0.5, max_batch_size=None, memory=1,
):
def batch_size_with_buffer(buffer, available, desired, max_batch_size):
batch_size = gp2(min(desired, max_batch_size or np.inf))
if available < batch_size * 3 + buffer:
# If we don't have enough records to support this batch_size, use 1 power lower
batch_size = np.power(2, np.log(batch_size) / np.log(2) - 1)
return int(batch_size)
memory = max(1, memory)
surplus = memory - 1
test_size_ratio = 1 - train_size_ratio
total_size = features.shape[0]
smallest_size = int(total_size * test_size_ratio / 2)
# Error on insufficient data
def insufficient_data():
raise RuntimeError(
f"Insufficient data on which to split train/validation/test when ratio={train_size_ratio}%, nobs={total_size} and memory={memory}"
)
if total_size < memory + 3:
insufficient_data()
# Find greatest batch size that is a power of 2, that fits the smallest dataset size, and is no greater than max_batch_size
batch_size = batch_size_with_buffer(
surplus, total_size, smallest_size, max_batch_size
)
test_size = smallest_size - smallest_size % batch_size
# Create/align the datasets
if use_ts_data:
index_offset = None
start = -test_size
X_test = features.iloc[start - surplus:]
y_test = targets.iloc[start:]
end = start
start = end - test_size
X_validation = features.iloc[start - surplus:end]
y_validation = targets.iloc[start:end]
end = start
start = (total_size + end - surplus) % batch_size
X_train = features.iloc[start:end]
y_train = targets.iloc[start + surplus:end]
else:
index_offset = memory
_features = features.shift(-1)
start = -test_size - memory
X_test = _features.iloc[start:]
y_test = targets.iloc[start:]
end = start + memory
start = end - test_size - memory
X_validation = _features.iloc[start:end]
y_validation = targets.iloc[start:end]
end = start + memory
start = (total_size + end - memory) % batch_size
X_train = _features.iloc[start:end]
y_train = targets.iloc[start:end]
# Record indexes
test_index = y_test.index[index_offset:]
validation_index = y_validation.index[index_offset:]
train_index = y_train.index[index_offset:]
if memory > X_train.shape[0] or memory > X_validation.shape[0]:
insufficient_data()
format_data = ts_data if use_ts_data else ts_gen
train = format_data(X_train.values, y_train.values, memory, batch_size=batch_size)
validation = format_data(
X_validation.values, y_validation.values, memory, batch_size=batch_size
)
test = format_data(X_test.values, y_test.values, memory, batch_size=batch_size)
# Print out the batched data for inspection
def results(dataset, index):
print("\n-------------------\n")
print(f"Index:\n\n", index, "\n\n")
last_i = len(dataset) - 1
for i, batch in enumerate(dataset):
inputs, targets = batch
if i == 0:
print(
f"First:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
if i == last_i:
print(
f"Last:\n\nInputs:\n",
inputs[0][-1],
"...",
inputs[-1][-1],
f"\n\nTargets:\n",
targets[0],
"...",
targets[-1],
)
print(inputs.shape, targets.shape, "\n\n")
print("\n-------------------\n")
results(train, train_index)
results(validation, validation_index)
results(test, test_index)
return (
batch_size,
train,
validation,
test,
train_index,
validation_index,
test_index,
)
# inputs and targets are expected to be aligned (i.e., loss functions should subtract the predicted target#t from the actual target#t)
x = np.arange(101)
df = pd.DataFrame(index=x)
df['inputs'] = x
df['targets'] = x
batch_size, train, validation, test, train_index, validation_index, test_index = train_validate_test_split(df['inputs'], df['targets'], train_size_ratio=0.5, max_batch_size=2, memory=8)
All loss/metric functions rely on y_pred and y_true assume matching indices. There's nothing special that Keras does in the background.

Getting x_test, y_test from generator in Keras?

For certain problems, the validation data can't be a generator, e.g.: TensorBoard histograms:
If printing histograms, validation_data must be provided, and cannot be a generator.
My current code looks like:
image_data_generator = ImageDataGenerator()
training_seq = image_data_generator.flow_from_directory(training_dir)
validation_seq = image_data_generator.flow_from_directory(validation_dir)
testing_seq = image_data_generator.flow_from_directory(testing_dir)
model = Sequential(..)
# ..
model.compile(..)
model.fit_generator(training_seq, validation_data=validation_seq, ..)
How do I provide it as validation_data=(x_test, y_test)?
Python 2.7 and Python 3.* solution:
from platform import python_version_tuple
if python_version_tuple()[0] == '3':
xrange = range
izip = zip
imap = map
else:
from itertools import izip, imap
import numpy as np
# ..
# other code as in question
# ..
x, y = izip(*(validation_seq[i] for i in xrange(len(validation_seq))))
x_val, y_val = np.vstack(x), np.vstack(y)
Or to support class_mode='binary', then:
from keras.utils import to_categorical
x_val = np.vstack(x)
y_val = np.vstack(imap(to_categorical, y))[:,0] if class_mode == 'binary' else y
Full runnable code: https://gist.github.com/AlecTaylor/7f6cc03ed6c3dd84548a039e2e0fd006
Update (22/06/2018): Read the answer provided by the OP for a concise and efficient solution. Read mine to understand what's going on.
In python you can get all the generators data using:
data = [x for x in generator]
But, ImageDataGenerators does not terminate and therefor the approach above would not work. But we can use the same approach with some modifications to work in this case:
data = [] # store all the generated data batches
labels = [] # store all the generated label batches
max_iter = 100 # maximum number of iterations, in each iteration one batch is generated; the proper value depends on batch size and size of whole data
i = 0
for d, l in validation_generator:
data.append(d)
labels.append(l)
i += 1
if i == max_iter:
break
Now we have two lists of tensor batches. We need to reshape them to make two tensors, one for data (i.e X) and one for labels (i.e. y):
data = np.array(data)
data = np.reshape(data, (data.shape[0]*data.shape[1],) + data.shape[2:])
labels = np.array(labels)
labels = np.reshape(labels, (labels.shape[0]*labels.shape[1],) + labels.shape[2:])

Python Minibatch Dictionary Learning

I'd like to implement error tracking with dictionary learning in python, using sklearn's MiniBatchDictionaryLearning, so that I can record how the error decreases over the iterations. I have two methods to do it, neither of which really worked. Set up:
Input data X, numpy array shape (n_samples, n_features) = (298143, 300). These are patches of shape (10, 10), generated from an image of shape (642, 480, 3).
Dictionary learning parameters: No. of columns (or atoms) = 100, alpha = 2, transform algorithm = OMP, total no. of iterations = 500 (keep it small first, just as a test case)
Calculating error: After learning the dictionary, I encode the original image again based on the learnt dictionary. Since both the encoding and the original are numpy arrays of the same shape (642, 480, 3), I'm just doing elementwise Euclidean distance for now:
err = np.sqrt(np.sum(reconstruction - original)**2))
I did a test run with these parameters, and the full fit was able to produce a pretty good reconstruction with a low error, so that's good.Now on to the two methods:
Method 1: Save the learnt dictionary every 100 iterations, and record the error. For 500 iterations, this gives us 5 runs of 100 iterations each. After each run, I compute the error, then use the currently learnt dictionary as an initialization for the next run.
# Fit an initial dictionary, V, as a first run
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = 100,
transform_algorithm='omp')
dl = dico.fit(patches)
V = dl.components_
# Now do another 4 runs.
# Note the warm restart parameter, dict_init = V.
for i in range(n_runs):
print("Run %s..." % i, end = "")
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = n_iterations,
transform_algorithm='omp',
dict_init = V)
dl = dico.fit(patches)
V = dl.components_
img_r = reconstruct_image(dico, V, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("Err = %s" % err)
Problem: The error isn't decreasing, and was pretty high. The dictionary wasn't learnt very well either.
Method 2: Cut the input data X into, say, 500 batches, and do partial fitting, using the partial_fit() method.
batch_size = 500
n_batches = X.shape[0] // batch_size
print(n_batches) # 596
for iternum in range(n_batches):
batch = patches[iternum*batch_size : (iternum+1)*batch_size]
V = dico.partial_fit(batch)
Problem: this seems to take about 5000 times longer.
I'd like to know if there's a way to retrieve the error over the fitting process?
Each call to fit re-initializes the model and forgets any previous call to fit: this is the expected behavior of all estimators in scikit-learn.
I think using partial_fit in a loop is the right solution but you should call it on small batches (as done in the fit method method, the default batch_size value is just 3) and then only compute the cost every 100 or 1000 calls to partial_fit for instance:
batch_size = 3
n_epochs = 20
n_batches = X.shape[0] // batch_size
print(n_batches) # 596
n_updates = 0
for epoch in range(n_epochs):
for i in range(n_batches):
batch = patches[i * batch_size:(i + 1) * batch_size]
dico.partial_fit(batch)
n_updates += 1
if n_updates % 100 == 0:
img_r = reconstruct_image(dico, dico.components_, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("[epoch #%02d] Err = %s" % (epoch, err))
I ran through the same problem and finally I was able to make the code much faster. If it's still useful to someone, adding the solution here. The catch is that while constructing the MiniBatchDictionaryLearning object we need to set n_iter to a low value (e.g., 1), so that for each partial_fit it does not run a single batch for too many epochs.
# Construct an initial dictionary object, note partial fit will be done later inside
# the loop, here we only specify that for partial_fit it needs just to run just 1
# epoch (n_iter=1) with batch_size=batch_size on the current batch provided
# (otherwise by default it can run upto 1000 iterations with batch_size=3 for a
# single partial_fit() and on each of the batches, which makes the a single run of
# partial_fit() very slow. Since we control the epoch on our own and it restarts
# when all the batches are done, we need not provide more than 1 iteration here.
# This will make the code to execute fast.
batch_size = 128 # e.g.,
dico = MiniBatchDictionaryLearning(n_components = 100,
alpha = 2,
n_iter = 1, # epoch per partial_fit()
batch_size = batch_size,
transform_algorithm='omp')
followed by #ogrisel's code:
n_updates = 0
for epoch in range(n_epochs):
for i in range(n_batches):
batch = patches[i * batch_size:(i + 1) * batch_size]
dico.partial_fit(batch)
n_updates += 1
if n_updates % 100 == 0:
img_r = reconstruct_image(dico, dico.components_, patches)
err = np.sqrt(np.sum((img - img_r)**2))
print("[epoch #%02d] Err = %s" % (epoch, err))

How to compute optimal threshold for accuracy

My classifier produces soft classifications and I wish to select an optimal threshold (that is, one that maximizes accuracy) from the results of the method on the training cases, and use this threshold to produce the hard classification. While in general the problem is relatively easy, I find it hard to optimise the code so that the computation does not last forever. Below you'll find the code that essentially recreates the optimisation procedure on some dummy data. Could you please point me into any direction which could possibly improve performance?
y_pred = np.random.rand(400000)
y_true = np.random.randint(2, size=400000)
accs = [(accuracy_score(y_true, y_pred > t), t) for t in np.unique(y_pred)]
train_acc, train_thresh = max(accs, key=lambda pair: pair[0])
I realize that I could sort both y_pred and y_true prior to the loop, and use that to my advantage when binarizing y_pred but that didn't bring much improvement (unless I did something wrong).
Any help would be much appreciated.
Sort y_pred descendantly and use Kadane's Algorithm to calculate an index i such that the subarray of y_true from 0 to i has maximum sum. Your optimal threshold b is then b = (y_pred[i] + y_pred[i+i]) / 2. This will be the same output that SVM would give you, that is, the hyperplane (or for your 1-dimensional case, a threshold) that maximizes the margin between classes.
I wrote a helper function in python:
def opt_threshold_acc(y_true, y_pred):
A = list(zip(y_true, y_pred))
A = sorted(A, key=lambda x: x[1])
total = len(A)
tp = len([1 for x in A if x[0]==1])
tn = 0
th_acc = []
for x in A:
th = x[1]
if x[0] == 1:
tp -= 1
else:
tn += 1
acc = (tp + tn) / total
th_acc.append((th, acc))
return max(th_acc, key=lambda x: x[1])

Categories