Dataset with variable-sized items from tf.dynamic_partition - python

Similar to this question, I want to build a TF dataset from a list with each element of different sizes. However, unlike the linked question, I would like to generate the dataset from the output of tf.dynamic_partition, which outputs a list of tensors.
My setup:
import tensorflow as tf
D = tf.data.Dataset # shorthand notation
x = tf.range(9) # Array to be partitioned
p = tf.constant([1,0,2,0,0,0,2,2,1]) # Defines partitions
The dataset should thus have three elements, containing [1 3 4 5], [0 8], and [2 6 7], respectively.
The direct approach fails, as expected:
dataset = D.from_tensor_slices(tf.dynamic_partition(x,p,3))
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
nl = sess.run(next_element)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shapes
of all inputs must match: values[0].shape = [4] != values[1].shape =
[2]
Next thing I tried is an application of the solution of the linked question, applying from_generator:
dataset = D.from_generator(lambda: tf.dynamic_partition(x,p,3), tf.int32, output_shapes=[None])
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
nl = sess.run(next_element)
tensorflow.python.framework.errors_impl.InvalidArgumentError:
exceptions.ValueError: setting an array element with a sequence.
How can I create a dataset with variable-sized items from the output of tf.dynamic_partition?

The from_generator doesn't work because it expects the generator function to yield numpy arrays and not tensors.
A way to solve your problem is to create one dataset for each element of the partition. In your case you partition the data into 3 groups, so you would create 3 dataset and combine them with tf.data.Dataset.concatenate():
x = tf.range(9) # Array to be partitioned
p = tf.constant([1, 0, 2, 0, 0, 0, 2, 2, 1]) # Defines partitions
partition = tf.dynamic_partition(x, p, 3)
dataset = tf.data.Dataset.from_tensors(partition[0])
for i in range(1, 3):
dataset_bis = tf.data.Dataset.from_tensors(partition[i])
dataset = dataset.concatenate(dataset_bis)
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
for i in range(3):
nl = sess.run(next_element)
print(nl)

Related

Tensorflow dataset with changing batch size to compute test loss during training

I'm trying to run a training loop where I periodically determine the current average loss and print it to the console. In order to determine the loss I'd like to use a different batch size. So it goes like this:
dataset = create_dataset().shuffle(1000).repeat().batch(minibatch_size)
iterator = dataset.make_one_shot_iterator() # using this iterator in the graph
while ...:
session.run(...) # perform training
if epoch % 10 = 0:
test_avg_loss = session.run(avg_loss) # want a different number of items here
I want a minibatch size of 10 during training but I'd like to test with 100 data points to obtain a better estimate for the average loss. How can make the dataset return a different number of items here? I tried passing a placeholder to batch but it seems unsupported. The error is:
'ValueError : Cannot capture a placeholder (name:batchSize, type:Placeholder) by value.'
I'm open to using a different code structure altogether if that seems like a better solution. I understand it is important to not pass data using feedDict for performance reasons so using a dataset seems like the way to go. I'm not seeking some kind of hack but I'd like to know what's the right way to do this.
A good solution is to use a reinitializable iterator, that let you switch between two (or more) Datasets, typically one for training and one for validation.
The example in the documentation is actually pretty neat:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(
lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(50)
# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
training_dataset.output_shapes)
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)
# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
for _ in range(20):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(100):
sess.run(next_element)
# Initialize an iterator over the validation dataset.
sess.run(validation_init_op)
for _ in range(50):
sess.run(next_element)
Just make sure in your case that the iterator you create has an unknown batch size.
Based on your comment, you should look into a feedable iterator that can be used together with tf.placeholder to select what Iterator to use in each call to tf.Session.run, via the familiar feed_dict mechanism. It offers the same functionality as a reinitializable iterator, but it does not require you to initialize the iterator from the start of a dataset when you switch between iterators.
# Training and validation datasets
training_dataset = tf.data.Dataset.range(100).repeat().batch(100)
validation_dataset = tf.data.Dataset.range(150, 200).repeat().batch(10)
# A feedable iterator to toggle between validation and training dataset
handle = tf.placeholder(tf.string, shape=[])
iterator = tf.data.Iterator.from_string_handle(
handle, training_dataset.output_types, training_dataset.output_shapes)
next_element = iterator.get_next()
training_iterator = training_dataset.make_one_shot_iterator()
validation_iterator = validation_dataset.make_one_shot_iterator()
with tf.Session() as sess:
# The `Iterator.string_handle()` method returns a tensor that can be evaluated
# and used to feed the `handle` placeholder.
training_handle = sess.run(training_iterator.string_handle())
validation_handle = sess.run(validation_iterator.string_handle())
# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
for _ in range(20):
for _ in range(100):
out = sess.run(next_element, feed_dict={handle: training_handle})
for _ in range(50):
out = sess.run(next_element, feed_dict={handle: validation_handle})
Shape your placeholder with [None, None]
Now during evaluate and training do something like this :
Give a structure to your training file :
import tensorflow as tf
def shape(dataset):
#shape your data here
return {'input':np.array(input_data),'label':np.array(labels)}
def evaluate(model,batch_size=100):
sess = tf.get_default_graph()
iteration = len(dataset) // batch_size
loss = []
for j in iteration:
dataset = dataset[j * batch_size:(j + 1) * batch_size]
#shape it here before feeding to network
dataset=shape(dataset)
out = sess.run(model, feed_dict={input_place: dataset['input'], labels: data['labels']})
loss.append(out['loss'])
return np.mean(loss)
def train(model,batch_size=10):
iteration=len(dataset)//batch_size
with tf.Session() as sess:
for i in epoch(epoch):
for j in iteration:
dataset = dataset[j * batch_size:(j + 1) * batch_size]
dataset = shape(dataset)
# shape it here before feeding to network
out = sess.run(model, feed_dict={input_place: dataset['input'], labels: data['labels']})
print(out['loss'], out['training_accuracy'])
print(evaluate(model))

Tensorflow csv input pipline: Shuffle batches but preserve sequence order within each batch

I build a recurrent neural network in tensorflow. Then I build a pipeline to import training data from my dataset (within a csv file) into my model. The procedure in the code section (source: here) works perfectly.
filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
coord.request_stop()
coord.join(threads)
The only additional thing is, that I use
data_batch = tf.train.batch(
[features], batch_size=n_steps_mul_batch_size, capacity=capacity)
data_batch_reshaped = tf.reshape(
data_batch, [batch_size, n_steps, feature_num])
to create batches for training in the shape [batch_size x timesteps x features].
Now my question:
I want to pass randomized/shuffled training data to my model, but at the same time preserve the sequence ordering within each batch. So the batches as a whole should be randomized/shuffled, but the sequences within each batch should preserve the original sequence order. Is there a simple way to do that?
As far as I know, there no simple way to do it in Tensorflow.
But, you can do it manually:
import numpy as np
import random
def get_next_batch(data, batch_size):
size = data.shape[0]
rand_samples = random.sample(data[:, 0, 0], batch_size)
data_batch = np.array([data[i] for i in range(size) if data[i, 0, 0] in rand_samples])
return data_batch

scatter_nd_add() example for sparse addition in TensroFlow

I am having difficulty applying tf.scatter_nd_add() to 2D tensors. The documentation is a bit unclear and has does not contain an example for sparse update but only for full slice updates.
My case is the following:
updates - 2D tensor of shape [None, 6]
indices - 2D tensor of shape [None, 6]
ref - 2D Variable of zeros of shape [None, 6]
It is guaranteed that updates, indices and ref will always have their first dimension equal, but the size of that dimension can be varying. The update I want to perform looks like
for i, j:
k = indices[i][j]
ref[i][k] += updates[i][j]
Note that indices contains duplicates. tf.scatter_nd_add(ref, indices, updates) complains about shape mismatch and I cannot figure out how I need to restructure the tensors in order to performs the update.
I figured it out. Each 2D entry in indices must actually specify the absolute location that will get updated in ref. This means that indices must be 3D and then the non-vectorized update looks like:
for i, j:
r, k = indices[i][j]
ref[r][k] += updates[i][j]
In the above question it just happens that r is always equal to i.
Here is a full Tensorflow implementation with varying shapes. For clarity, in the following example, col_indices corresponds to indices from the original question:
import tensorflow as tf
import numpy as np
updates = tf.placeholder(dtype=tf.float32, shape=[None, 6])
col_indices = tf.placeholder(dtype=tf.int32, shape=[None, 6])
row_indices = tf.cumsum(tf.ones_like(col_indices), axis=0, exclusive=True)
indices = tf.concat([tf.expand_dims(row_indices, axis=-1),
tf.expand_dims(col_indices, axis=-1)], axis=-1)
tmp_var = tf.Variable(0, trainable=False, dtype=tf.float32, validate_shape=False)
ref = tf.assign(tmp_var, tf.zeros_like(updates), validate_shape=False)
# This makes sure that ref is always 0 before scatter_nd_add() runs
with tf.control_dependencies([target_var]):
result = tf.scatter_nd_add(ref, indices, updates)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Create example input data
np_input = np.arange(0, 6, 1, dtype=np.int32)
np_input = np.tile(np_input[None,:], [10, 1])
res = sess.run(result, feed_dict={updates: np_input, col_indices: np_input})
print(res)

How do I use the "group_by_window" function in TensorFlow

In TensorFlow's new set of input pipeline functions, there is an ability to group sets of records together using the "group_by_window" function. It is described in the documentation here:
https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset#group_by_window
I don't fully understand the explanation here used to describe the function, and I tend to learn best by example. I can't find any example code anywhere on the internet for this function. Could someone please whip up a barebones and runnable example of this function to show how it works, and what to give this function?
For tensorflow version 1.9.0
Here is a quick example I could come up with:
import tensorflow as tf
import numpy as np
components = np.arange(100).astype(np.int64)
dataset = tf.data.Dataset.from_tensor_slices(components)
dataset = dataset.apply(tf.contrib.data.group_by_window(key_func=lambda x: x%2, reduce_func=lambda _, els: els.batch(10), window_size=100)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()
sess = tf.Session()
sess.run(features) # array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int64)
The first argument key_func maps every element in the dataset to a key.
The window_size defines the bucket size that is given to the reduce_fund.
In the reduce_func you receive a block of window_size elements. You can shuffle, batch or pad however you want.
EDIT for dynamic padding and bucketing using the group_by_window fucntion more here :
If you have a tf.contrib.dataset which holds (sequence, sequence_length, label) and sequence is a tensor of tf.int64:
def bucketing_fn(sequence_length, buckets):
"""Given a sequence_length returns a bucket id"""
t = tf.clip_by_value(buckets, 0, sequence_length)
return tf.argmax(t)
def reduc_fn(key, elements, window_size):
"""Receives `window_size` elements"""
return elements.shuffle(window_size, seed=0)
# Create buckets from 0 to 500 with an increment of 15 -> [0, 15, 30, ... , 500]
buckets = [tf.constant(num, dtype=tf.int64) for num in range(0, 500, 15)
window_size = 1000
# Bucketing
dataset = dataset.group_by_window(
lambda x, y, z: bucketing_fn(x, buckets),
lambda key, x: reduc_fn(key, x, window_size), window_size)
# You could pad it in the reduc_func, but I'll do it here for clarity
# The last element of the dataset is the dynamic sentences. By giving it tf.Dimension(None) it will pad the sencentences (with 0) according to the longest sentence.
dataset = dataset.padded_batch(batch_size, padded_shapes=(
tf.TensorShape([]), tf.TensorShape([]), tf.Dimension(None)))
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
features = iterator.get_next()

Looping over a tensor

I am trying to process a tensor of variable size, in a python way that would be something like:
# X is of shape [m, n]
for x in X:
process(x)
I have tried to use tf.scan, the thing is that I want to process every sub-tensor, so I have tried to use a nested scan, but I was enable to do it, because tf.scan work with the accumulator, if not found it will take the first entry of the elems as initializer, which I don't want to do.
As an example, suppose I want to add one to every element of my tensor (this is just an example), and I want to process it element by element. If I run the code bellow, I will only have one added to a sub-tensor, because scan consider the first tensor as initializer, along with the first element of every sub-tensor.
import numpy as np
import tensorflow as tf
batch_x = np.random.randint(0, 10, size=(5, 10))
x = tf.placeholder(tf.float32, shape=[None, 10])
def inner_loop(x_in):
return tf.scan(lambda _, x_: x_ + 1, x_in)
outer_loop = tf.scan(lambda _, input_: inner_loop(input_), x, back_prop=True)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
rs = sess.run(outer_loop, feed_dict={x: batch_x})
Any suggestions ?
To loop over a tensor you could try tf.unstack
Unpacks the given dimension of a rank-R tensor into rank-(R-1) tensors.
So adding 1 to each tensor would look something like:
import tensorflow as tf
x = tf.placeholder(tf.float32, shape=(None, 10))
x_unpacked = tf.unstack(x) # defaults to axis 0, returns a list of tensors
processed = [] # this will be the list of processed tensors
for t in x_unpacked:
# do whatever
result_tensor = t + 1
processed.append(result_tensor)
output = tf.concat(processed, 0)
with tf.Session() as sess:
print(sess.run([output], feed_dict={x: np.zeros((5, 10))}))
Obviously you can further unpack each tensor from the list to process it, down to single elements. To avoid lots of nested unpacking though, you could maybe try flattening x with tf.reshape(x, [-1]) first, and then loop over it like
flattened_unpacked = tf.unstack(tf.reshape(x, [-1])
for elem in flattened_unpacked:
process(elem)
In this case elem is a scalar.
Most of tensorflow built-in functions could be applied elementwise. So you could just pass a tensor into a function. Like:
outer_loop = inner_loop(x)
However, if you have some function that could not be applied this way (it's really tempting to see that function), you could use map_fn.
Say, your function simply adds 1 to every element of a tensor (or whatever):
inputs = tf.placeholder...
def my_elementwise_func(x):
return x + 1
def recursive_map(inputs):
if tf.shape(inputs).ndims > 0:
return tf.map_fn(recursive_map, inputs)
else:
return my_elementwise_func(inputs)
result = recursive_map(inputs)

Categories