I want to train a neural network using gradient descent on batches that contain N training points each. I would like these batches to only contain points with the same label, instead of being randomly sampled from the training set.
For example, if I'm training using MNIST, I would like to have batches that look like the following:
batch_1 = {0,0,0,0,0,0,0,0}
batch_2 = {3,3,3,3,3,3,3,3}
batch_3 = {7,7,7,7,7,7,7,7}
.....
and so on.
How can I do it using pytorch?
One way to do it is to create subsets and dataloaders for each class and then iterate by randomly switching between the dataloaders at each iteration:
import torch
from torch.utils.data import DataLoader, Subset
from torchvision.datasets import MNIST
from torchvision import transforms
import numpy as np
dataset = MNIST('path/to/mnist_root/',
transform=transforms.ToTensor(),
download=True)
class_inds = [torch.where(dataset.targets == class_idx)[0]
for class_idx in dataset.class_to_idx.values()]
dataloaders = [
DataLoader(
dataset=Subset(dataset, inds),
batch_size=8,
shuffle=True,
drop_last=False)
for inds in class_inds]
epochs = 1
for epoch in range(epochs):
iterators = list(map(iter, dataloaders))
while iterators:
iterator = np.random.choice(iterators)
try:
images, labels = next(iterator)
print(labels)
# do_more_stuff()
except StopIteration:
iterators.remove(iterator)
This will work with any dataset (not just the MNIST).
Here's the result of printing the labels at each iteration:
tensor([6, 6, 6, 6, 6, 6, 6, 6])
tensor([3, 3, 3, 3, 3, 3, 3, 3])
tensor([0, 0, 0, 0, 0, 0, 0, 0])
tensor([5, 5, 5, 5, 5, 5, 5, 5])
tensor([8, 8, 8, 8, 8, 8, 8, 8])
tensor([0, 0, 0, 0, 0, 0, 0, 0])
...
tensor([1, 1, 1, 1, 1, 1, 1, 1])
tensor([1, 1, 1, 1, 1, 1])
Note that by setting drop_last=False, there will be batches, here and there, with less than batch_size elements. By setting it to True, the batches will be all of equal size, but some data points will be dropped.
Related
I understand why this error usually occurs, which is that the input >= embedding_dim.
However in my case the torch.max(inputs) = embedding_dim - 1.
print('inputs: ', src_seq)
print('input_shape: ', src_seq.shape)
print(self.src_word_emb)
inputs: tensor([[10, 6, 2, 4, 9, 14, 6, 2, 5, 0],
[12, 6, 3, 8, 13, 2, 0, 1, 1, 1],
[13, 8, 12, 7, 2, 4, 0, 1, 1, 1]])
input_shape: [3, 10]
Embedding(15, 512, padding_idx=1)
emb = self.src_word_emb(src_seq)
I try to get a transformer model to work and for some reason the encoder embedding only accepts inputs < embedding_dim_decoder, which does not make sense right?
Found the error source! In the transformer model the encoder and decoder can be set up to share the same embedding weights. However, I had a translation task with one embedding for the decoder and one embedding for the encoder. In the code it initializes the weights via:
if emb_src_trg_weight_sharing:
self.encoder.src_word_emb.weight = self.decoder.trg_word_emb.weight
Setting emb_src_trg_weight_sharing to false solved the issue!
I have a simple X_train and Y_train data:
x_train = [
array([ 6, 1, 9, 10, 7, 7, 1, 9, 10, 3, 10, 1, 4]),
array([ 2, 8, 8, 1, 1, 4, 2, 5, 1, 2, 7, 2, 1, 1, 4, 5, 10, 4])
]
y_train = [23, 17]
Arrays are numpy arrays.
I am now trying to use the tf.data.Dataset class to load these as tensors.
Before I have done a similar thing successfully using the following code:
dataset = data.Dataset.from_tensor_slices((x_train, y_train))
As this input is fed into a RNN, I have used the expand_dims method in the first RNN layer (the expand_dimension is passed as a function to overcome an apparent bug in tensorflow: see https://github.com/keras-team/keras/issues/5298#issuecomment-281914537):
def expand_dimension(x):
from tensorflow import expand_dims
return expand_dims(x, axis=-1)
model = models.Sequential(
[
layers.Lambda(expand_dimension,
input_shape=[None]),
layers.LSTM(units=64, activation='tanh'),
layers.Dense(units=1)
]
)
This worked although because I had arrays of equal length. In the example I posted instead the 1st array has 13 numbers and the 2nd one 18.
In this case the method above doesn't work, and the recommended method seems to be using tf.data.Dataset.from_generator.
Reading this How to use the Tensorflow Dataset Pipeline for Variable Length Inputs?, the accepted solution shows something like the following would work (where I am not caring here about y_train for simplicity):
dataset = tf.data.Dataset.from_generator(lambda: x_train,
tf.as_dtype(x_train[0].dtype),
tf.TensorShape([None, ]))
However, the syntax in tensorflow has changed since this answer, and now it requires to use the output_signature argument (see https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator).
I've tried different ways but I'm finding hard to understand from tensorflow documentation what the output_signature should exactly be in my case.
Any help would be much appreciated.
Short answer is, you can define output_signature as follows.
import tensorflow as tf
import numpy as np
x_train = [
np.array([ 6, 1, 9, 10, 7, 7, 1, 9, 10, 3, 10, 1, 4]),
np.array([ 2, 8, 8, 1, 1, 4, 2, 5, 1, 2, 7, 2, 1, 1, 4, 5, 10, 4])
]
y_train = [23, 17]
dataset = tf.data.Dataset.from_generator(
lambda: x_train,
output_signature=tf.TensorSpec(
[None, ],
dtype=tf.as_dtype(x_train[0].dtype)
)
)
I'll also expand and improve on some things you're doing here to improve your pipeline.
Using both inputs and labels
dataset = tf.data.Dataset.from_generator(
lambda: zip(x_train, y_train),
output_signature=(
tf.TensorSpec([None, ], dtype=tf.as_dtype(x_train[0].dtype)),
tf.TensorSpec([], dtype=tf.as_dtype(y_train.dtype))
)
)
for x in dataset:
print(x)
Which would output,
(<tf.Tensor: shape=(13,), dtype=int64, numpy=array([ 6, 1, 9, 10, 7, 7, 1, 9, 10, 3, 10, 1, 4])>, <tf.Tensor: shape=(), dtype=int64, numpy=23>)
(<tf.Tensor: shape=(18,), dtype=int64, numpy=
array([ 2, 8, 8, 1, 1, 4, 2, 5, 1, 2, 7, 2, 1, 1, 4, 5, 10,
4])>, <tf.Tensor: shape=(), dtype=int64, numpy=17>)
Caveat: This can get slightly more complicated if you try to tf.data.Dataset.batch() items. Then you need to use RaggedTensorSpec instead of TensorSpec. Also, I haven't experimented too much with feeding in ragged tensors into a RNN. But I think those are out of scope for the question you've asked.
I'm writing a very simple network:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
training_data = np.array([[1, 1, 1], [2, 3, 1], [0, -1, 4], [0, 3, 0], [10, -6, 8], [-3, -12, 4]])
testing_data = np.array([6, 11, 1, 9, 10, -38])
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units = 1, activation = tf.keras.activations.relu, input_shape = (3, )))
model.compile(optimizer = tf.keras.optimizers.RMSprop(0.001), loss = tf.keras.losses.mean_squared_error, metrics = tf.keras.metrics.mean_squared_error)
model.summary()
model.fit(training_data, testing_data, epochs = 1, verbose = 'False')
print("Traning completed.")
model.predict(np.array([1, 1, 1]))
The goal is to train the weights like : aX + bY + cZ = (output)
But I get the error
ValueError: Input 0 of layer sequential_54 is incompatible with the layer: expected axis -1 of input shape to have value 3 but received input with shape [None, 1]
I can't make scene of the dimensions, there is something I'm doing wrong! Any help?
In Keras when you specify the input shape batch size is ignored, please refer here for more details. Your declaration of input_shape = (3, ) is correct, but when you do inference you need to account for the batch size as well by adding an extra dimension for the same so instead of np.array([1, 1, 1]) you need to have np.array([[1, 1, 1]]).
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
training_data = np.array([[1, 1, 1], [2, 3, 1], [0, -1, 4], [0, 3, 0], [10, -6, 8], [-3, -12, 4]])
testing_data = np.array([6, 11, 1, 9, 10, -38])
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units = 1, activation = tf.keras.activations.relu, input_shape = (3,)))
model.compile(optimizer = tf.keras.optimizers.RMSprop(0.001), loss = tf.keras.losses.mean_squared_error, metrics = [tf.keras.metrics.mean_squared_error])
model.summary()
model.fit(training_data, testing_data, epochs = 1, verbose = 'False')
print("Traning completed.")
model.predict(np.array([[1, 2, 1]]))
array([[0.08026636]], dtype=float32)
For an LSTM network, I've seen great improvements with bucketing.
I've come across the bucketing section in the TensorFlow docs which (tf.contrib).
Though in my network, I am using the tf.data.Dataset API, specifically I'm working with TFRecords, so my input pipeline looks something like this
dataset = tf.data.TFRecordDataset(TFRECORDS_PATH)
dataset = dataset.map(_parse_function)
dataset = dataset.map(_scale_function)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.padded_batch(batch_size, padded_shapes={.....})
How can I incorporate the bucketing method into a the tf.data.Dataset pipeline?
If it matters, in every record in the TFRecords file I have the sequence length saved as an integer.
Various bucketing use cases using Dataset API are explained well here.
bucket_by_sequence_length() example:
def elements_gen():
text = [[1, 2, 3], [3, 4, 5, 6, 7], [1, 2], [8, 9, 0, 2]]
label = [1, 2, 1, 2]
for x, y in zip(text, label):
yield (x, y)
def element_length_fn(x, y):
return tf.shape(x)[0]
dataset = tf.data.Dataset.from_generator(generator=elements_gen,
output_shapes=([None],[]),
output_types=(tf.int32, tf.int32))
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length(element_length_func=element_length_fn,
bucket_batch_sizes=[2, 2, 2],
bucket_boundaries=[0, 8]))
batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
for _ in range(2):
print('Get_next:')
print(sess.run(batch))
Output:
Get_next:
(array([[1, 2, 3, 0, 0],
[3, 4, 5, 6, 7]], dtype=int32), array([1, 2], dtype=int32))
Get_next:
(array([[1, 2, 0, 0],
[8, 9, 0, 2]], dtype=int32), array([1, 2], dtype=int32))
I'm trying to create a data input pipeline from a Tensorflow Dataset that consists of 1d tensors of numerical data. I would like to create batches of ragged tensors; I do not want to pad the data.
For instance, if my data is of the form:
[
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4]
...
]
I would like my dataset to consist of batches of the form:
<tf.Tensor [
<tf.RaggedTensor [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4],
...]>,
<tf.RaggedTensor [
[ ... ],
...]>
]>
I've tried creating a RaggedTensor using a map but I can't seem to do it on one dimensional data.
I think this can be achieved with a little work before and after the batch.
# First, you can expand along the 0 axis for each data point
dataset = dataset.map(lambda x: tf.expand_dims(x, 0))
# Then create a RaggedTensor with a ragged rank of 1
dataset = dataset.map(lambda x: tf.RaggedTensor.from_tensor(x))
# Create batches
dataset = dataset.batch(BATCH_SIZE)
# Squeeze the extra dimension from the created batches
dataset = dataset.map(lambda x: tf.squeeze(x, axis=1))
Then the final output will be of the form:
<tf.RaggedTensor [
<tf.Tensor [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]>,
<tf.Tensor [0, 1, 2, 3]>,
...
]>
for each batch.