In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here:
https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window
I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?
For tensorflow version 1.9.0
Here is a quick example I could come up with:
import tensorflow as tf
import numpy as np
components = np.arange(100).astype(np.int64)
dataset = tf.data.Dataset.from_tensor_slices(components)
dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
sess = tf.Session()
sess.run(features) # array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int64)
The first argument key_func maps every element in the dataset to a key.
The window_size defines the bucket size that is given to the reduce_fund.
In the reduce_func you receive a block of window_size elements. You can shuffle, batch or pad however you want.
EDIT for dynamic padding and bucketing using the group_by_window fucntion more here :
If you have a tf.contrib.dataset which holds (sequence, sequence_length, label) and sequence is a tensor of tf.int64:
def bucketing_fn(sequence_length, buckets):
"""Given a sequence_length returns a bucket id"""
t = tf.clip_by_value(buckets, 0, sequence_length)
return tf.argmax(t)
def reduc_fn(key, elements, window_size):
"""Receives `window_size` elements"""
return elements.shuffle(window_size, seed=0)
# Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500]
buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15)
window_size = 1000
# Bucketing
dataset = dataset.group_by_window(
lambda x, y, z: bucketing_fn(x, buckets),
lambda key, x: reduc_fn(key, x, window_size), window_size)
# You could pad it in the reduc_func, but I'll do it here for clarity
# The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence.
dataset = dataset.padded_batch(batch_size, padded_shapes=(
tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None)))
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
Related
I'm trying to implement a beam search decoding strategy in a text generation model. This is the function that I am using to decode the output probabilities.
def beam_search_decoder(data, k):
sequences = [[list(), 0.0]]
# walk over each step in sequence
for row in data:
all_candidates = list()
for i in range(len(sequences)):
seq, score = sequences[i]
for j in range(len(row)):
candidate = [seq + [j], score - torch.log(row[j])]
all_candidates.append(candidate)
# sort candidates by score
ordered = sorted(all_candidates, key=lambda tup:tup[1])
sequences = ordered[:k]
return sequences
Now you can see this function is implemented with batch_size 1 in mind. Adding another loop for batch size would make the algorithm O(n^4). It is slow as it is now. Is there any way to improve the speed of this function. My model output is usually of the size (32, 150, 9907) which follows the format (batch_size, max_len, vocab_size)
Below is my implementation, which may be a little bit faster than the for loop implementation.
import torch
def beam_search_decoder(post, k):
"""Beam Search Decoder
Parameters:
post(Tensor) – the posterior of network.
k(int) – beam size of decoder.
Outputs:
indices(Tensor) – a beam of index sequence.
log_prob(Tensor) – a beam of log likelihood of sequence.
Shape:
post: (batch_size, seq_length, vocab_size).
indices: (batch_size, beam_size, seq_length).
log_prob: (batch_size, beam_size).
Examples:
>>> post = torch.softmax(torch.randn([32, 20, 1000]), -1)
>>> indices, log_prob = beam_search_decoder(post, 3)
"""
batch_size, seq_length, _ = post.shape
log_post = post.log()
log_prob, indices = log_post[:, 0, :].topk(k, sorted=True)
indices = indices.unsqueeze(-1)
for i in range(1, seq_length):
log_prob = log_prob.unsqueeze(-1) + log_post[:, i, :].unsqueeze(1).repeat(1, k, 1)
log_prob, index = log_prob.view(batch_size, -1).topk(k, sorted=True)
indices = torch.cat([indices, index.unsqueeze(-1)], dim=-1)
return indices, log_prob
You can use this library
https://pypi.org/project/pytorch-beam-search/
It implements Beam Search, Greedy Search and sampling for PyTorch sequence models.
The following snippet implements a Transformer seq2seq model and uses it to generate predictions.
#pip install pytorch-beam-search
from pytorch_beam_search import seq2seq
# Create vocabularies
# Tokenize the way you need
source = [list("abcdefghijkl"), list("mnopqrstwxyz")]
target = [list("ABCDEFGHIJKL"), list("MNOPQRSTWXYZ")]
# An Index object represents a mapping from the vocabulary to
# to integers (indices) to feed into the models
source_index = seq2seq.Index(source)
target_index = seq2seq.Index(target)
# Create tensors
X = source_index.text2tensor(source)
Y = target_index.text2tensor(target)
# X.shape == (n_source_examples, len_source_examples) == (2, 11)
# Y.shape == (n_target_examples, len_target_examples) == (2, 12)
# Create and train the model
model = seq2seq.Transformer(source_index, target_index) # just a PyTorch model
model.fit(X, Y, epochs = 100) # basic method included
# Generate new predictions
new_source = [list("new first in"), list("new second in")]
new_target = [list("new first out"), list("new second out")]
X_new = source_index.text2tensor(new_source)
Y_new = target_index.text2tensor(new_target)
loss, error_rate = model.evaluate(X_new, Y_new) # basic method included
predictions, log_probabilities = seq2seq.beam_search(model, X_new)
output = [target_index.tensor2text(p) for p in predictions]
output
I am using image_dataset_from_directory to load a very large RGB imagery dataset from disk into a Dataset. For example,
dataset = tf.keras.preprocessing.image_dataset_from_directory(
<directory>,
label_mode=None,
seed=1,
subset='training',
validation_split=0.1)
The Dataset has, say, 100000 images grouped into batches of size 32 yielding a tf.data.Dataset with spec (batch=32, width=256, height=256, channels=3)
I would like to extract patches from the images to create a new tf.data.Dataset with image spatial dimensions of, say, 64x64.
Therefore, I would like to create a new Dataset with 400000 patches still in batches of 32 with a tf.data.Dataset with spec (batch=32, width=64, height=64, channels=3)
I've looked at the window method and the extract_patches function but it's not clear from the documentation how to use them to create a new Dataset I need to start training on the patches. The window seems to be geared toward 1D tensors and the extract_patches seems to work with arrays and not with Datasets.
Any suggestions on how to accomplish this?
UPDATE:
Just to clarify my needs. I am trying to avoid manually creating the patches on disk. One, that would be untenable disk wise. Two, the patch size is not fixed. The experiments will be conducted over several patch sizes. So, I do not want to manually perform the patch creation either on disk or manually load the images in memory and perform the patching. I would prefer to have tensorflow handle the patch creation as part of the pipeline workflow to minimize disk and memory usage.
What you're looking for is tf.image.extract_patches. Here's an example:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import numpy as np
data = tfds.load('mnist', split='test', as_supervised=True)
get_patches = lambda x, y: (tf.reshape(
tf.image.extract_patches(
images=tf.expand_dims(x, 0),
sizes=[1, 14, 14, 1],
strides=[1, 14, 14, 1],
rates=[1, 1, 1, 1],
padding='VALID'), (4, 14, 14, 1)), y)
data = data.map(get_patches)
fig = plt.figure()
plt.subplots_adjust(wspace=.1, hspace=.2)
images, labels = next(iter(data))
for index, image in enumerate(images):
ax = plt.subplot(2, 2, index + 1)
ax.set_xticks([])
ax.set_yticks([])
ax.imshow(image)
plt.show()
I believe you can use a python class generator. You can pass this generator to model.fit function if you want. I actually used it once for labels preprocessing.
I wrote the following dataset generator that loads a batch from your dataset, splits the images from the batch into multiple images based on the tile_shape parameter. If there are enough images, the next batch is returned.
In the example, I used a simple dataset from_tensor_slices for simplification. You can, of course, replace it with yours.
import tensorflow as tf
class TileDatasetGenerator:
def __init__(self, dataset, batch_size, tile_shape):
self.dataset_iterator = iter(dataset)
self.batch_size = batch_size
self.tile_shape = tile_shape
self.image_queue = None
def __iter__(self):
return self
def __next__(self):
if self._has_queued_enough_for_batch():
return self._dequeue_batch()
batch = next(self.dataset_iterator)
self._split_images(batch)
return self.__next__()
def _has_queued_enough_for_batch(self):
return self.image_queue is not None and tf.shape(self.image_queue)[0] >= self.batch_size
def _dequeue_batch(self):
batch, remainder = tf.split(self.image_queue, [self.batch_size, -1], axis=0)
self.image_queue = remainder
return batch
def _split_images(self, batch):
batch_shape = tf.shape(batch)
batch_splitted = tf.reshape(batch, shape=[-1, self.tile_shape[0], self.tile_shape[1], batch_shape[-1]])
if self.image_queue is None:
self.image_queue = batch_splitted
else:
self.image_queue = tf.concat([self.image_queue, batch_splitted], axis=0)
dataset = tf.data.Dataset.from_tensor_slices(tf.ones(shape=[128, 64, 64, 3]))
dataset.batch(32)
generator = TileDatasetGenerator(dataset, batch_size = 16, tile_shape = [32,32])
for batch in generator:
tf.print(tf.shape(batch))
Edit:
It is possible to convert the generator to tf.data.Dataset if you want, but it requires that you add a __call__ function to the generator returning an iterator (self in this case).
new_dataset = tf.data.Dataset.from_generator(generator, output_types=(tf.int64))
I have a keras model with two inputs of different shape. One side takes in few categorical features, while the other takes multiple time series with length PAST_HISTORY. The output is also multiple time series:
# Categorical data input
input_ct = keras.Input(shape=(len(categ_cols),),
name='categorical_input')
# Timeseries input
input_ts = keras.Input(shape=(PAST_HISTORY, len(series_cols)),
name='timeseries_input')
...
model = keras.models.Model(inputs=[input_ct, input_ts], outputs=outputs)
I created a Dataset for each input and for the output using a pandas DataFrame and some tf.data.Dataset operations.
df_ts = df[series_cols][:-FUTURE_TARGET]
ts_batch = lambda window: window.batch(PAST_HISTORY)
time_series_data = tf.data.Dataset.from_tensor_slices(df_ts)\
.window(PAST_HISTORY, 1, 1, True)\
.flat_map(ts_batch)
df_cat = df[categ_cols][PAST_HISTORY - 1:-FUTURE_TARGET]
date_data = tf.data.Dataset.from_tensor_slices(df_cat)
df_target = df[target_cols][PAST_HISTORY:]
target_batch = lambda window: window.batch(FUTURE_TARGET)
target_data = tf.data.Dataset.from_tensor_slices(df_target)\
.window(FUTURE_TARGET, 1, 1, True)\
.flat_map(target_batch)
To create the final Dataset I used a generator:
def generator():
for d1, d2, t in zip(date_data, time_series_data, target_data):
yield {"categorical_input": d1, "timeseries_input": d2}, tf.transpose(t)
dataset = tf.data.Dataset.from_generator(generator,
output_types=(
{'categorical_input': tf.int64, 'timeseries_input': tf.float64},
tf.float64),
output_shapes=(
{'categorical_input': (len(categ_cols),),'timeseries_input': (PAST_HISTORY, len(series_cols))},
(len(target_cols), FUTURE_TARGET),))
This worked and I managed to train a model on eager execution by calling model.fit. However now that I'm trying to create an Estimator from this model the creation of the Dataset no longer works as it implicitly uses the __iterator__ function which is disallowed on lazy evaluation. Specifically the problem lies in the zip operation on the generator.
I tried to create the same dataset without the generator with the following code:
dataset = tf.data.Dataset.from_tensors(
({'categorical_input': date_data, 'timeseries_input': time_series_data}, target_data)
)
This gets me following error when I try to call estimator.train:
TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops._NestedVariant'> to Tensor.
Contents: <tensorflow.python.data.ops.dataset_ops._NestedVariant object at 0x7f5bf84a97f0>.
Consider casting elements to a supported type.
What is the way to solve this error? Or is there another way to construct this Dataset without having to call an iterator on a Dataset?
Also, I tried to cast the Datasets and got the following error on the windowed Datasets:
TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops.FlatMapDataset'> to Tensor.
Contents: <FlatMapDataset shapes: (None, 2), types: tf.float64>.
Consider casting elements to a supported type.
Dummy data:
df = pd.DataFrame(data={
'ts_1': np.random.rand(10000),
'ts_2': np.random.rand(10000),
'ts_objective': np.random.rand(10000),
'cat_1': np.random.randint(1, 10 + 1, 10000),
'cat_2': np.random.randint(1, 25 + 1, 10000),
'cat_3': np.random.randint(1, 30 + 1, 10000),
'cat_4': np.random.randint(1, 50 + 1, 10000)})
categ_cols = ['cat_1', 'cat_2', 'cat_3', 'cat_4']
series_cols = ['ts_1', 'ts_2']
target_cols = ['ts_objective']
PAST_HISTORY = 24
FUTURE_TARGET = 8
You can build the dataset you need without using a generator (and much faster) using Dataset operations only:
import tensorflow as tf
date_data = ...
time_series_data = ...
target_data = ...
def data_tx(d1, d2, t):
return {"categorical_input": d1, "timeseries_input": d2}, tf.transpose(t)
dataset = tf.data.Dataset.zip((date_data, time_series_data, target_data)).map(data_tx)
From the Tensorflow dataset guide it says
It is often convenient to give names to each component of an element,
for example if they represent different features of a training
example. In addition to tuples, you can use collections.namedtuple or
a dictionary mapping strings to tensors to represent a single element
of a Dataset.
dataset = tf.data.Dataset.from_tensor_slices(
{"a": tf.random_uniform([4]),
"b": tf.random_uniform([4, 100], maxval=100, dtype=tf.int32)})
print(dataset.output_types) # ==> "{'a': tf.float32, 'b': tf.int32}"
print(dataset.output_shapes) # ==> "{'a': (), 'b': (100,)}"
https://www.tensorflow.org/guide/datasets
And this is very useful in Keras. If you pass a dataset object to model.fit, the names of the components can be used to match the inputs of your Keras model. Example:
image_input = keras.Input(shape=(32, 32, 3), name='img_input')
timeseries_input = keras.Input(shape=(None, 10), name='ts_input')
x1 = layers.Conv2D(3, 3)(image_input)
x1 = layers.GlobalMaxPooling2D()(x1)
x2 = layers.Conv1D(3, 3)(timeseries_input)
x2 = layers.GlobalMaxPooling1D()(x2)
x = layers.concatenate([x1, x2])
score_output = layers.Dense(1, name='score_output')(x)
class_output = layers.Dense(5, activation='softmax', name='class_output')(x)
model = keras.Model(inputs=[image_input, timeseries_input],
outputs=[score_output, class_output])
train_dataset = tf.data.Dataset.from_tensor_slices(
({'img_input': img_data, 'ts_input': ts_data},
{'score_output': score_targets, 'class_output': class_targets}))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
model.fit(train_dataset, epochs=3)
So it would be useful for look up, add, and change names to components in tf dataset objects. What is the best way to go about doing these tasks?
You can use map to bring modifications to your dataset, if that is what you are looking for. For example, to transform a plain tuple output to a dict with meaningful names,
import tensorflow as tf
# dummy example
ds_ori = tf.data.Dataset.zip((tf.data.Dataset.range(0, 10), tf.data.Dataset.range(10, 20)))
ds_renamed = ds_ori.map(lambda x, y: {'input': x, 'output': y})
batch_ori = ds_ori.make_one_shot_iterator().get_next()
batch_renamed = ds_renamed.make_one_shot_iterator().get_next()
with tf.Session() as sess:
print(sess.run(batch_ori))
print(sess.run(batch_renamed))
# (0, 10)
# {'input': 0, 'output': 10}
While the accepted answer is good for changing names of (existing)components, it does not talk about 'addition'. This can be done as follows:
y_dataset = x_dataset.map(fn1)
where you can define fn1 as you want
#tf.function
def fn1(x):
##use x to derive additional columns u want. Set the shape as well
y = {}
y.update(x)
y['new1'] = new1
y['new2'] = new2
return y
Similar to this question, I want to build a TF dataset from a list with each element of different sizes. However, unlike the linked question, I would like to generate the dataset from the output of tf.dynamic_partition, which outputs a list of tensors.
My setup:
import tensorflow as tf
D = tf.data.Dataset # shorthand notation
x = tf.range(9) # Array to be partitioned
p = tf.constant([1,0,2,0,0,0,2,2,1]) # Defines partitions
The dataset should thus have three elements, containing [1 3 4 5], [0 8], and [2 6 7], respectively.
The direct approach fails, as expected:
dataset = D.from_tensor_slices(tf.dynamic_partition(x,p,3))
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
nl = sess.run(next_element)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shapes
of all inputs must match: values[0].shape = [4] != values[1].shape =
[2]
Next thing I tried is an application of the solution of the linked question, applying from_generator:
dataset = D.from_generator(lambda: tf.dynamic_partition(x,p,3), tf.int32, output_shapes=[None])
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
nl = sess.run(next_element)
tensorflow.python.framework.errors_impl.InvalidArgumentError:
exceptions.ValueError: setting an array element with a sequence.
How can I create a dataset with variable-sized items from the output of tf.dynamic_partition?
The from_generator doesn't work because it expects the generator function to yield numpy arrays and not tensors.
A way to solve your problem is to create one dataset for each element of the partition. In your case you partition the data into 3 groups, so you would create 3 dataset and combine them with tf.data.Dataset.concatenate():
x = tf.range(9) # Array to be partitioned
p = tf.constant([1, 0, 2, 0, 0, 0, 2, 2, 1]) # Defines partitions
partition = tf.dynamic_partition(x, p, 3)
dataset = tf.data.Dataset.from_tensors(partition[0])
for i in range(1, 3):
dataset_bis = tf.data.Dataset.from_tensors(partition[i])
dataset = dataset.concatenate(dataset_bis)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(3):
nl = sess.run(next_element)
print(nl)