SubsetRandomSampler does not always include all target labels when testing

SubsetRandomSampler does not always include all target labels when testing - python

For my ML-project with pytorch I am dividing my initial data set into a training and testing set, using a custom function which ensures that all labels in my original data set exist in both training and testing set:
(working_indices, working_labels, testing_indices, testing_labels,) = split_dataset_equally_random(
target_labels=train_labels,
percentage_to_split=100 - TEST_PERCENTAGE,
random_seed=-1,
)
unique_testing_indices = set([label.detach().numpy().tolist() for label in testing_labels])
count_of_unique_testing_indices = {}
for entry in unique_testing_indices:
count_of_unique_testing_indices[entry] = testing_labels.detach().numpy().tolist().count(entry)
testing_labels_from_indices = [train_labels[index] for index in testing_indices]
#count_of_unique_testing_indices = {i: count for (i, testing_labels.detach().numpy().tolist().count(i)) in unique_testing_indices}
print(f"Unique testing indices: {unique_testing_indices}")
print(f"Unique testing indices from labels: {set([index.detach().numpy().tolist() for index in testing_labels_from_indices])}")
print(f"Number of unique testing indices: {count_of_unique_testing_indices}")
For my current application that gives me the output
Unique testing indices: {0, 1, 2, 3}
Unique testing indices from labels: {0, 1, 2, 3}
Number of unique testing indices: {0: 2160, 1: 4104, 2: 3024, 3: 1080}
Now, for testing I use the following code:
test_loader = DataLoader(
TensorDataset(feat, torch_labels),
batch_size=self.Model.module_options["batch_size"],
sampler=SubsetRandomSampler(testing_indices),
)
accuracy, predictions, prediction_distributions, actual_labels = self.Model.evaluate_model(
test_data_loader=test_loader
)
with evaluate_model() being defined as
def evaluate_model(self, test_data_loader=None):
"""_summary_
Args:
test_data_loader (_type_, optional): _description_. Defaults to None.
Returns:
_type_: _description_
"""
self.model.eval()
if test_data_loader is not None:
predictions, actuals = list(), list()
for (inputs, targets) in test_data_loader:
#print(f"{inputs}, {targets}")
print(f"{set(targets.detach().numpy().tolist())}")
inputs, targets = inputs.to(self.device), targets.to(self.device)
# yhat, yhat_x = self.model(inputs).to("cpu")
yhat, yhat_x = self.model(inputs)
yhat = yhat.to("cpu")
yhat_x = yhat_x.to("cpu")
# print(yhat.detach().numpy(), yhat_x)
yhat = yhat.detach().numpy()
actual = targets.to("cpu").numpy()
actual = actual.reshape((len(actual), 1))
# yhat = yhat.round()
predictions.append(yhat)
actuals.append(actual)
prediction_distributions, actuals = np.vstack(predictions), np.vstack(actuals)
predictions = np.argmax(prediction_distributions, axis=1)
acc = accuracy_score(actuals, predictions)
return acc, predictions, prediction_distributions, actuals
else:
print("Test_data_loader is none")
return -1, -1, -1, -1
Unfortunately, sometimes I run into the issue that my sampler only picks features corresponding to two or three of the four labels during testing, i.e. one or two labels will be completely omitted during the entire run, which then is also reflected in the predictions (i.e. when plotting the used labels during testing I might only get {0, 1, 2} instead of {0, 1, 2, 3} during the entire testing run).
Why is that happening, and how can I avoid it in future sessions? The problem goes away by simply re-running the evaluation function.

Related

How to view predicted values from MultiStep tensor flow model?

I used timeseries dataframe with 9 variables, trying to predict 1 of them. I followed the official tutorial and got the final model. But I don't know how to view the predicted values.
#Split the data
column_indices = {name: i for i, name in enumerate(df.columns)}
n = len(df)
train_data = df[0:int(n*0.7)]
val_data = df[int(n*0.7):int(n*0.9)]
test_data = df[int(n*0.9):]
num_features = df.shape[1]
#Data windowing....
#Split the data
#Train the model
class MultiStepLastBaseline(tf.keras.Model):
def call(self, inputs):
return tf.tile(inputs[:, -1:, :], [1, OUT_STEPS, 1])
last_baseline = MultiStepLastBaseline()
last_baseline.compile(loss=tf.losses.MeanSquaredError(),
metrics=[tf.metrics.MeanAbsoluteError()])
multi_val_performance = {}
multi_performance = {}
multi_val_performance['Last'] = last_baseline.evaluate(multi_window.val)
multi_performance['Last'] = last_baseline.evaluate(multi_window.test, verbose=0)
multi_window.plot(last_baseline)
Finally I got this plot
model output plot
Now how can I see the predicted values for next two days? I tired the following but it says missing positional argument x
MultiStepLastBaseline.predict(test_data)

Batch-wise beam search in pytorch

I'm trying to implement a beam search decoding strategy in a text generation model. This is the function that I am using to decode the output probabilities.
def beam_search_decoder(data, k):
sequences = [[list(), 0.0]]
# walk over each step in sequence
for row in data:
all_candidates = list()
for i in range(len(sequences)):
seq, score = sequences[i]
for j in range(len(row)):
candidate = [seq + [j], score - torch.log(row[j])]
all_candidates.append(candidate)
# sort candidates by score
ordered = sorted(all_candidates, key=lambda tup:tup[1])
sequences = ordered[:k]
return sequences
Now you can see this function is implemented with batch_size 1 in mind. Adding another loop for batch size would make the algorithm O(n^4). It is slow as it is now. Is there any way to improve the speed of this function. My model output is usually of the size (32, 150, 9907) which follows the format (batch_size, max_len, vocab_size)

Below is my implementation, which may be a little bit faster than the for loop implementation.
import torch
def beam_search_decoder(post, k):
"""Beam Search Decoder
Parameters:
post(Tensor) – the posterior of network.
k(int) – beam size of decoder.
Outputs:
indices(Tensor) – a beam of index sequence.
log_prob(Tensor) – a beam of log likelihood of sequence.
Shape:
post: (batch_size, seq_length, vocab_size).
indices: (batch_size, beam_size, seq_length).
log_prob: (batch_size, beam_size).
Examples:
>>> post = torch.softmax(torch.randn([32, 20, 1000]), -1)
>>> indices, log_prob = beam_search_decoder(post, 3)
"""
batch_size, seq_length, _ = post.shape
log_post = post.log()
log_prob, indices = log_post[:, 0, :].topk(k, sorted=True)
indices = indices.unsqueeze(-1)
for i in range(1, seq_length):
log_prob = log_prob.unsqueeze(-1) + log_post[:, i, :].unsqueeze(1).repeat(1, k, 1)
log_prob, index = log_prob.view(batch_size, -1).topk(k, sorted=True)
indices = torch.cat([indices, index.unsqueeze(-1)], dim=-1)
return indices, log_prob

You can use this library
https://pypi.org/project/pytorch-beam-search/
It implements Beam Search, Greedy Search and sampling for PyTorch sequence models.
The following snippet implements a Transformer seq2seq model and uses it to generate predictions.
#pip install pytorch-beam-search
from pytorch_beam_search import seq2seq
# Create vocabularies
# Tokenize the way you need
source = [list("abcdefghijkl"), list("mnopqrstwxyz")]
target = [list("ABCDEFGHIJKL"), list("MNOPQRSTWXYZ")]
# An Index object represents a mapping from the vocabulary to
# to integers (indices) to feed into the models
source_index = seq2seq.Index(source)
target_index = seq2seq.Index(target)
# Create tensors
X = source_index.text2tensor(source)
Y = target_index.text2tensor(target)
# X.shape == (n_source_examples, len_source_examples) == (2, 11)
# Y.shape == (n_target_examples, len_target_examples) == (2, 12)
# Create and train the model
model = seq2seq.Transformer(source_index, target_index) # just a PyTorch model
model.fit(X, Y, epochs = 100) # basic method included
# Generate new predictions
new_source = [list("new first in"), list("new second in")]
new_target = [list("new first out"), list("new second out")]
X_new = source_index.text2tensor(new_source)
Y_new = target_index.text2tensor(new_target)
loss, error_rate = model.evaluate(X_new, Y_new) # basic method included
predictions, log_probabilities = seq2seq.beam_search(model, X_new)
output = [target_index.tensor2text(p) for p in predictions]
output

Keras: How to expand validation_split to generate a third set i.e. test set?

I am using Keras with a TensorFlow backend. I am using the ImageDataGenerator with the validation_split argument to split my data into train set and validation set. As such, I use flow_from_directory with the subset set to "training" and "testing" like so:
total_gen = ImageDataGenerator(validation_split=0.3)
train_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=BATCH_SIZE, subset="training")
valid_gen = data_generator.flow_from_directory(my_dir, target_size=(input_size, input_size), shuffle=False, seed=13,
class_mode='categorical', batch_size=32, subset="validation")
This is amazingly convenient, as it allows me to use only one directory instead of two (one for training and one for validation). Now I wonder if it is possible to expand this process in order to generating a third set i.e. test set?

This is not possible out of the box. You should be able to do it with some minor modifications to the source code of ImageDataGenerator:
if subset is not None:
if subset not in {'training', 'validation'}: # add a third subset here
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation".') # adjust message
split_idx = int(len(x) * image_data_generator._validation_split)
# you'll need two split indices here
if subset == 'validation':
x = x[:split_idx]
x_misc = [np.asarray(xx[:split_idx]) for xx in x_misc]
if y is not None:
y = y[:split_idx]
elif subset == '...' # add extra case here
else:
x = x[split_idx:]
x_misc = [np.asarray(xx[split_idx:]) for xx in x_misc] # change slicing
if y is not None:
y = y[split_idx:] # change slicing
Edit: this is how you could modify the code:
if subset is not None:
if subset not in {'training', 'validation', 'test'}:
raise ValueError('Invalid subset name:', subset,
'; expected "training" or "validation" or "test".')
split_idxs = (int(len(x) * v) for v in image_data_generator._validation_split)
if subset == 'validation':
x = x[:split_idxs[0]]
x_misc = [np.asarray(xx[:split_idxs[0]]) for xx in x_misc]
if y is not None:
y = y[:split_idxs[0]]
elif subset == 'test':
x = x[split_idxs[0]:split_idxs[1]]
x_misc = [np.asarray(xx[split_idxs[0]:split_idxs[1]]) for xx in x_misc]
if y is not None:
y = y[split_idxs[0]:split_idxs[1]]
else:
x = x[split_idxs[1]:]
x_misc = [np.asarray(xx[split_idxs[1]:]) for xx in x_misc]
if y is not None:
y = y[split_idxs[1]:]
Basically, validation_split is now expected to be a tuple of two floats instead of a single float. The validation data will be the fraction of data between 0 and validation_split[0], test data between validation_split[0] and validation_split[1] and training data between validation_split[1] and 1. This is how you can use it:
import keras
# keras_custom_preprocessing is how i named my directory
from keras_custom_preprocessing.image import ImageDataGenerator
generator = ImageDataGenerator(validation_split=(0.1, 0.5))
# First 10%: validation data - next 40% test data - rest: training data
gen = generator.flow_from_directory(directory='./data/', subset='test')
# Finds 40% of the images in the dir
You will need to modify the file in two or three additional lines (there is a typecheck you will have to change), but that's it and that should work. I have the modified file, let me know if you are interested, I can host it on my github.

How do I use the "group_by_window" function in TensorFlow

In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here:
https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window
I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?

For tensorflow version 1.9.0
Here is a quick example I could come up with:
import tensorflow as tf
import numpy as np
components = np.arange(100).astype(np.int64)
dataset = tf.data.Dataset.from_tensor_slices(components)
dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
sess = tf.Session()
sess.run(features) # array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int64)
The first argument key_func maps every element in the dataset to a key.
The window_size defines the bucket size that is given to the reduce_fund.
In the reduce_func you receive a block of window_size elements. You can shuffle, batch or pad however you want.
EDIT for dynamic padding and bucketing using the group_by_window fucntion more here :
If you have a tf.contrib.dataset which holds (sequence, sequence_length, label) and sequence is a tensor of tf.int64:
def bucketing_fn(sequence_length, buckets):
"""Given a sequence_length returns a bucket id"""
t = tf.clip_by_value(buckets, 0, sequence_length)
return tf.argmax(t)
def reduc_fn(key, elements, window_size):
"""Receives `window_size` elements"""
return elements.shuffle(window_size, seed=0)
# Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500]
buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15)
window_size = 1000
# Bucketing
dataset = dataset.group_by_window(
lambda x, y, z: bucketing_fn(x, buckets),
lambda key, x: reduc_fn(key, x, window_size), window_size)
# You could pad it in the reduc_func, but I'll do it here for clarity
# The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence.
dataset = dataset.padded_batch(batch_size, padded_shapes=(
tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None)))
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()

Lasagne/Theano wrong number of dimensions

Headed into Lasagne and Theano with a modified mnist.py (the primary example of Lasagne) to train a very simple XOR.
import numpy as np
import theano
import theano.tensor as T
import time
import lasagne
X_train = [[[[0, 0], [0, 1], [1, 0], [1, 1]]]] # (1)
y_train = [[[[1, 0], [0, 1], [0, 1], [1, 0]]]]
# [0, 1, 1, 0]
X_train = np.array(X_train).astype(np.uint8)
y_train = np.array(y_train).astype(np.uint8)
print X_train.shape
X_val = X_train
y_val = y_train
X_test = X_train
y_test = y_train
def build_mlp(input_var=None):
# This creates an MLP of two hidden layers of 800 units each, followed by
# a softmax output layer of 10 units. It applies 20% dropout to the input
# data and 50% dropout to the hidden layers.
# Input layer, specifying the expected input shape of the network
# (unspecified batchsize, 1 channel, 28 rows and 28 columns) and
# linking it to the given Theano variable `input_var`, if any:
l_in = lasagne.layers.InputLayer(shape=(None, 1, 4, 2), # (2)
input_var=input_var)
# Apply 20% dropout to the input data:
# l_in_drop = lasagne.layers.DropoutLayer(l_in, p=0.2)
# Add a fully-connected layer of 800 units, using the linear rectifier, and
# initializing weights with Glorot's scheme (which is the default anyway):
l_hid1 = lasagne.layers.DenseLayer(
l_in, num_units=4,
nonlinearity=lasagne.nonlinearities.rectify,
W=lasagne.init.GlorotUniform())
# Finally, we'll add the fully-connected output layer, of 10 softmax units:
l_out = lasagne.layers.DenseLayer(
l_hid1, num_units=2,
nonlinearity=lasagne.nonlinearities.softmax)
# Each layer is linked to its incoming layer(s), so we only need to pass
# the output layer to give access to a network in Lasagne:
return l_out
# Prepare Theano variables for inputs and targets
input_var = T.tensor4('inputs')
target_var = T.ivector('targets')
network = build_mlp(input_var)
# Create a loss expression for training, i.e., a scalar objective we want
# to minimize (for our multi-class problem, it is the cross-entropy loss):
prediction = lasagne.layers.get_output(network)
loss = lasagne.objectives.categorical_crossentropy(prediction, target_var)
loss = loss.mean()
# We could add some weight decay as well here, see lasagne.regularization.
# Create update expressions for training, i.e., how to modify the
# parameters at each training step. Here, we'll use Stochastic Gradient
# Descent (SGD) with Nesterov momentum, but Lasagne offers plenty more.
params = lasagne.layers.get_all_params(network, trainable=True)
updates = lasagne.updates.nesterov_momentum(
loss, params, learning_rate=0.01, momentum=0.9)
# Create a loss expression for validation/testing. The crucial difference
# here is that we do a deterministic forward pass through the network,
# disabling dropout layers.
test_prediction = lasagne.layers.get_output(network, deterministic=True)
test_loss = lasagne.objectives.categorical_crossentropy(test_prediction,
target_var)
test_loss = test_loss.mean()
# As a bonus, also create an expression for the classification accuracy:
test_acc = T.mean(T.eq(T.argmax(test_prediction, axis=1), target_var),
dtype=theano.config.floatX)
# Compile a function performing a training step on a mini-batch (by giving
# the updates dictionary) and returning the corresponding training loss:
train_fn = theano.function([input_var, target_var], loss, updates=updates)
# Compile a second function computing the validation loss and accuracy:
val_fn = theano.function([input_var, target_var], [test_loss, test_acc])
# ############################# Batch iterator ###############################
# This is just a simple helper function iterating over training data in
# mini-batches of a particular size, optionally in random order. It assumes
# data is available as numpy arrays. For big datasets, you could load numpy
# arrays as memory-mapped files (np.load(..., mmap_mode='r')), or write your
# own custom data iteration function. For small datasets, you can also copy
# them to GPU at once for slightly improved performance. This would involve
# several changes in the main program, though, and is not demonstrated here.
def iterate_minibatches(inputs, targets, batchsize, shuffle=False):
assert len(inputs) == len(targets)
if shuffle:
indices = np.arange(len(inputs))
np.random.shuffle(indices)
for start_idx in range(0, len(inputs) - batchsize + 1, batchsize):
if shuffle:
excerpt = indices[start_idx:start_idx + batchsize]
else:
excerpt = slice(start_idx, start_idx + batchsize)
yield inputs[excerpt], targets[excerpt]
else:
if shuffle:
excerpt = indices[0:len(inputs)]
else:
excerpt = slice(0, len(inputs))
yield inputs[excerpt], targets[excerpt]
num_epochs = 4
# Finally, launch the training loop.
print("Starting training...")
# We iterate over epochs:
for epoch in range(num_epochs):
# In each epoch, we do a full pass over the training data:
train_err = 0
train_batches = 0
start_time = time.time()
for batch in iterate_minibatches(X_train, y_train, 4, shuffle=True):
inputs, targets = batch
print inputs.shape, targets.shape, input_var.shape, input_var.ndim, inputs.ndim
train_err += train_fn(inputs, targets) # (3)
train_batches += 1
# And a full pass over the validation data:
val_err = 0
val_acc = 0
val_batches = 0
for batch in iterate_minibatches(X_val, y_val, 4, shuffle=False):
inputs, targets = batch
err, acc = val_fn(inputs, targets)
val_err += err
val_acc += acc
val_batches += 1
# Then we print the results for this epoch:
print("Epoch {} of {} took {:.3f}s".format(
epoch + 1, num_epochs, time.time() - start_time))
print(" training loss:\t\t{:.6f}".format(train_err / train_batches))
print(" validation loss:\t\t{:.6f}".format(val_err / val_batches))
print(" validation accuracy:\t\t{:.2f} %".format(
val_acc / val_batches * 100))
# After training, we compute and print the test error:
test_err = 0
test_acc = 0
test_batches = 0
for batch in iterate_minibatches(X_test, y_test, 500, shuffle=False):
inputs, targets = batch
err, acc = val_fn(inputs, targets)
test_err += err
test_acc += acc
test_batches += 1
print("Final results:")
print(" test loss:\t\t\t{:.6f}".format(test_err / test_batches))
print(" test accuracy:\t\t{:.2f} %".format(
test_acc / test_batches * 100))
# Optionally, you could now dump the network weights to a file like this:
# np.savez('model.npz', lasagne.layers.get_all_param_values(network))
Defined a training set at (1), modified the input to the new dimension at (2) and get an exception at (3):
Traceback (most recent call last):
File "test.py", line 139, in <module>
train_err += train_fn(inputs, targets)
File "/usr/local/lib/python2.7/site-packages/theano/compile/function_module.py", line 513, in __call__
allow_downcast=s.allow_downcast)
File "/usr/local/lib/python2.7/site-packages/theano/tensor/type.py", line 169, in filter
data.shape))
TypeError: ('Bad input argument to theano function with name "test.py:91" at index 1(0-based)', 'Wrong number of dimensions: expected 1, got 4 with shape (1, 1, 4, 2).')
And I have no clue what I did wrong. When I print the dimension (or the output of the program until the exception) I get this
(1, 1, 4, 2)
Starting training...
(1, 1, 4, 2) (1, 1, 4, 2) Shape.0 4 4
Which seem to be perfect. What I'm doing wrong and how must the array be formed to work?

The problem is with the second input, targets. Note that the error message indicated this by saying "...at index 1(0-based)...", i.e. the second parameter.
target_var is an ivector but you're providing a 4-dimensional tensor for targets. The solution is to alter your y_train dataset so that it is 1-dimensional:
y_train = [0, 1, 1, 0]
This will cause another error because you currently assert that the first dimension of the inputs and targets should match, but changing
assert len(inputs) == len(targets)
to
assert inputs.shape[2] == len(targets)
will solve the second problem and allow the script to run successfully.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SubsetRandomSampler does not always include all target labels when testing - python

Related

How to view predicted values from MultiStep tensor flow model?

Batch-wise beam search in pytorch

Keras: How to expand validation_split to generate a third set i.e. test set?

How do I use the "group_by_window" function in TensorFlow

Lasagne/Theano wrong number of dimensions

Categories

Resources