Most tutorials focus on the case where the entire training dataset fits into memory. However, I have an iterator which acts as an infinite stream of (features, labels)-tuples (creating them cheaply on the fly).
When implementing the input_fn for tensorflows estimator, can I return an instance from the iterator as
def input_fn():
(feature_batch, label_batch) = next(it)
return tf.constant(feature_batch), tf.constant(label_batch)
or does input_fn has to return the same (features, labels)-tuples on each call?
Moreover is this function called multiple times during training as I hope it is like in the following pseudocode:
for i in range(max_iter):
learn_op(input_fn())
The argument of input_fn are used throughout training but the function itself is called once. So creating a sophisticated input_fn that goes beyond returning a constant array as explained in the tutorial is not as straightforward.
Tensorflow proposes two examples of such non-trivial input_fn for numpy and panda arrays, but they start from an array in memory, so this does not help you with your problem.
You could also have a look at their code by following the links above, to see how they implement an efficient non-trivial input_fn, but you may find that it requires more code that you would like.
If you are willing to use the less-high level interface of Tensorflow, things are IMHO simpler and more flexible. There is a tutorial that covers most needs and the proposed solutions are easy(-er) to implement.
In particular, if you already have an iterator that returns data as you described in your question, using placeholders (section "Feeding" in the previous link) should be straightforward.
I found a pull request which converts a generator to an input_fn:
https://github.com/tensorflow/tensorflow/pull/7045/files
The relevant part is
def _generator_input_fn():
"""generator input function."""
queue = feeding_functions.enqueue_data(
x,
queue_capacity,
shuffle=shuffle,
num_threads=num_threads,
enqueue_size=batch_size,
num_epochs=num_epochs)
features = (queue.dequeue_many(batch_size) if num_epochs is None
else queue.dequeue_up_to(batch_size))
if not isinstance(features, list):
features = [features]
features = dict(zip(input_keys, features))
if target_key is not None:
if len(target_key) > 1:
target = {key: features.pop(key) for key in target_key}
else:
target = features.pop(target_key[0])
return features, target
return features
return _generator_input_fn
from tensorflow.contrib.learn.python.learn.learn_io import generator_io
import numpy as np
# define generator
def generator():
for index in range(2):
yield {'a': np.ones(1) * index,'b': np.ones(1) * index + 32,'label': np.ones(1) * index - 32}
input_fn = generator_io.generator_input_fn(generator, target_key='label', batch_size=2, shuffle=False, num_epochs=1)
features, target = input_fn()
Refer to the test case https://github.com/tensorflow/tensorflow/pull/7045/files
Related
I have a VGG16 model implemented with Keras/tensorflow.
When I call model.fit, I pass in a generator of data. The generator does transforms necessary for a VGGNet:
Preprocess the images with vgg16.preprocess_input
Convert the label to a one-hot vector via to_categorical
The generator can be seen below and works. Unfortunately, since there are multiple epochs, I have to set dataset.repeat(-1) (infinitely repeat) so the generator doesn't run out. This in turn requires one to pass a steps_per_epoch so a given iteration of training can complete. As you're probably thinking, this is brittle, (hinges on a known dataset cardinality)!
I have decided it's best to preprocess the training Dataset once up front using Dataset.map. However, I am struggling with the construction of a mapping function, it seems to_categorical doesn't work with a tf.Tensor. Down below is what I have right now, but I am not sure if there's a latent bug.
How can I correctly translate the below Dataset generator into a Dataset.map function?
Current Dataset Generator
This is implemented (and known to work) with Python 3.8 and tensorflow==2.4.4.
from typing import Iterable, Tuple
import numpy as np
import tensorflow as tf
def make_vgg_preprocessing_generator(
dataset: tf.data.Dataset, num_repeat: int = -1
) -> Iterable[Tuple[tf.Tensor, np.ndarray]]:
num_classes = len(dataset.class_names)
for batch_images, batch_labels in dataset.repeat(num_repeat):
pre_images = tf.keras.applications.vgg16.preprocess_input(batch_images)
pre_labels = tf.keras.utils.to_categorical(batch_labels, num_classes)
yield pre_images, pre_labels
train_ds: tf.data.Dataset # Not provided in this sample
model.fit(
make_vgg_preprocessing_generator(train_ds)
epochs=10,
steps_per_epoch=10, # Required since default num_repeat is indefinitely
)
Dataset.map Function
Here is my current translation that I would like to improve.
def vgg_preprocess_dataset(dataset: tf.data.Dataset) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
return pre_x, pre_y
return dataset.map(_preprocess)
Yes, you're on the right track! You'll want to replace to_categorical with tf.one_hot, just as you have, as tf.one_hot is specifically for tensors, and is designed for this context. Next, you might want to play around with some of the other tf.data.Dataset methods here and add them to your pipeline. Right now, your batch size will be one sample, and un-shuffled. An example of some other processing you might do:
def vgg_preprocess_dataset(dataset: tf.data.Dataset, batch_size=32, shuffle_buffer=1000) -> tf.data.Dataset:
num_classes = len(dataset.class_names)
def _preprocess(x: tf.Tensor, y: tf.Tensor):
pre_x = tf.keras.applications.vgg16.preprocess_input(x)
pre_y = tf.one_hot(y, depth=num_classes)
# pre_y = to_categorical(y, num_classes)
return pre_x, pre_y
# bigger buffer is better but slower
dataset = dataset.shuffle(shuffle_buffer)
# do your mapping after shuffle
dataset = dataset.map(_preprocess)
# next batch it
dataset = dataset.batch(batch_size)
# this allows your CPU to fetch the next batch (do the above shuffling, mapping, etc) during the
# current GPU pass, so that the GPU has minimal downtime
dataset = dataset.prefetch(2)
return dataset
ds = vgg_preprocess_dataset(ds)
# and you just pass it right to fit!
model.fit(ds)
I am facing slow training runs and I have tried to scale up the training procedure by using Tensorflow's Strategy API to utilize all 4 GPUs.
I'm using MirrorStrategy and using experimental_distribute_dataset to partition the dataset.
Nature of my training data is a mix of both sparse matrices and dense matrices. I'm using a generator to construct my dataset (which picks random indices to pick from the data). However, in my current version of TF (2.1) generators don't support sparse matrices. The sparse_matrix does not have a static size and is a Ragged tensor.
This bit is ugly and a workaround, but I'm passing my sparse_matrix_list directly to the train function, and index into it by having a global queue that gets populated by pushing the random indices inside generator.
Now this approach was working fine, but it was way too slow, and I wanted to try training with all GPUs. This gets even more problematic as I have to manually partition the sparse_matrix_list into num_workers splits.
However, the main problem right now is that the training procedure does not seem to be parallel and the replicas (GPUs) seem to be running sequentially.. I verified this through nvidia-smi and logs in the train_process function.
I have no prior experience with distributed training, and not sure why this would be the case, and I would really appreciate it if someone has pointers for a better way of handling this mix of spare and dense data. I'm currently facing a huge bottleneck in fetching my data which underutilizes the GPUs (fluctuates between 10-30%)
def distributed_train_step(inputs, sparse_matrix_list):
per_replica_losses = strategy.experimental_run_v2(train_process, args=(inputs, sparse_matrix_list)
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)
def train_process(inputs, sparse_matrix_list):
worker_id = tf.distribute.get_replica_context().replica_id_in_sync_group
replica_batch_size = inputs.shape[0]
slice_start = replica_batch_size * worker_id
replica_sparse_matrix = sparse_matrix_list[slice_start:slice_start + replica_batch_size]
return train_step(inputs, replica_sparse_matrix)
def train_step(inputs, sparse_matrix_list):
with tf.GradientTape() as tape:
outputs, mu, sigma, feat_out, logit = model(inputs)
loss = K.backend.mean(custom_loss(inputs, sparse_matrix_list)
return loss
def get_batch_data(sparse_matrix_list):
# Queue with the random indices into the training data (List of Lists with each
# entry len == batch_size)
# train_indicie is a global q
next_batch_indicies = train_indicies.get()
batch_sparse_list = sparse_matrix_list[next_batch_indicies]
dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
for batch, inputs in enumerate(dist_dataset, 1):
# sparse_matrix_list is passed to this main "train" function from outside this module.
batch_sparse_matrix_slice = get_batch_data(sparse_matrix_list)
loss = distributed_train_step(inputs, batch_sparse_matrix_slice)
i'm trying to fit my deep learning model with a custom generator.
When i fit the model, it shows me this error:
I tried to find similar questions, but all the answers were about converting lists to numpy array. I think that's not the question in this error. My lists are all in numpy array format. This custom generator is based on a custom generator from here
This is the part of code where I fit the model:
train_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=training_filenames, batch_size=batch_size)
val_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
filenames=validation_filenames, batch_size=batch_size)
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator,
)
return 0
where the variables are:
representations_path - is a string with the directory to the path where i store the training files, that which file is the input to model
target_path - is a string with the directory to the path where i store the target files, that which file is the target of the model (output)
training_filenames - is a list with the names of training and target files (both have the same name, but they are in different folders)
batch_size - integer with the size of the batch. It has the value 7.
My generator class is below:
import np
from tensorflow_core.python.keras.utils.data_utils import Sequence
class RepresentationGenerator(Sequence):
def __init__(self, representation_path, target_path, filenames, batch_size):
self.filenames = np.array(filenames)
self.batch_size = batch_size
self.representation_path = representation_path
self.target_path = target_path
def __len__(self):
return (np.ceil(len(self.filenames) / float(self.batch_size))).astype(np.int)
def __getitem__(self, idx):
files_to_batch = self.filenames[idx * self.batch_size: (idx + 1) * self.batch_size]
batch_x, batch_y = [], []
for file in files_to_batch:
batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True))
batch_y.append(np.load(self.target_path + file + ".npy", allow_pickle=True))
return np.array(batch_x), np.array(batch_y)
These are the values, when the method fit is called:
How can I fix this error?
Thank you mates!
When I call the method fit_generator, it calls the method fit.
The method fit, calls the method func.fit and it passes the variable Y that is set as None
The error occurs in this line:
Final solution:
Import from the correct place:
from tensorflow.keras.utils import Sequence
Old answers:
If __getitem__ is never called, the problem might be in __len__. You're not returning an int, you're returning a np.int.
I suggest you try:
def __len__(self):
length = len(self.filenames) // self.batch_size
if len(self.filenames) % self.batch_size > 0:
length += 1
return length
But if __getitem__ is being called and your data returned, then you should inspect your arrays.
Get an item from the generator yourself and check the content:
x, y = train_generator[0]
Are they single arrays? Or are they arrays of arrays? (Must be single)
What are their shapes? Do they have the expected shapes?
What are their types? Usually they should be float, sometimes int (for inputs to embedding layers), very rarely string (for inputs to custom layers that know how to treat strings).
The outputs must always be float, at most int (for sparse losses)
Other suppositions, you're using fit with batch_size while using a generator.... this is strange and the "if" clauses inside the method may not be well prepared, you might be falling into another training case.
Go straight to the usual options:
self.model_semantic.fit_generator(train_generator,
epochs=10,
verbose=1,
validation_data=val_generator)
Your generator is a Sequence, it already has a __len__, you don't need to specify steps_per_epoch or validation_steps.
Every generator has automatic batch sizes, every step is a batch and that's it. You don't need to specify batch_size in fit_generator.
If you're going to use fit, go like this:
...fit(train_generator, steps_per_epoch = len(train_generator),
epochs = 10, verbose = 1,
validation_data = val_generator, validation_steps = len(val_generator))
Finally, you should be hunting for anything that might be None (as the error message suggests) in your code.
Check if every function has a return line.
Check all inputs of your generator in __init__.
Print the filenames inside the generator.
Get the __len__ of the generator.
Try to get an item from the generator: x, y= train_generator[0]
I'm new to tensorflow and try to understand how to use outside of a machine learning context. I would like to optimize a python function with the ADAM implemenation of tensorflow.
Let's assume I have the following function:
def fun_test(x):
"""
:param x: List of parameters, e.g. [1,2,3]
:return: real value
"""
res=do_something(x)
return res
When using scipy, I would call 'scipy.minimize(fun_test,x0,method="Nelder-Mead")'. How could I do this with tensorflow?
Best,
Michael
You need to rewrite the function do_something to take tensors as inputs and returns a scalar tensor (i.e. creating a computation graph). Then the following code is a sketch of how to perform optimization on the function. (BTW, in your code fun_test and do_something has no real difference so I picked the latter).
x = tf.get_variable("x", dtype=..., initializer=...)
target = do_something(x)
opt = tf.train.AdamOptimizer(...).minimize(target) # Defines one optimization step
with tf.Session() as sess:
sess.run(tf.global_variables_initializer()) # Initialize x
NUM_STEPS = 1000
for _ in range(NUM_STEPS):
sess.run(opt) # Run optimization for NUM_STEPS steps
print(sess.run(x)) # Show values of x
print(sess.run(target)) # Show target value
I'm implementing an algorithm involving alternating optimization. That is, at each iteration, the algorithm fetches a data batch, and uses the data batch to optimize two losses sequentially. My current implementation with tf.data.Dataaset and tf.data.Iterator is something like this (which is indeed incorrect as detailed below):
data_batch = iterator.get_next()
train_op_1 = get_train_op(data_batch)
train_op_2 = get_train_op(data_batch)
for _ in range(num_steps):
sess.run(train_op_1)
sess.run(train_op_2)
Note that the above is incorrect because each call of sess.run will advance the iterator to get next data batch. So train_op_1 and train_op_2 are indeed using different data batches.
I cannot do something like sess.run([train_op_1, train_op_2]) either, because the two optimization steps need to be sequential (i.e., the 2nd optimization step depends on the latest variable value by the 1st optimization step.)
I'm wondering is there any way to somehow "freeze" the iterator, so that it won't advance in a sess.run call?
I was doing something similar so that is part of my code stripped from some unnecessary stuff. It does a bit more as it has train and validation iterators, but you should get the idea of using is_keep_previous flag. Basically passed as True it fill force reuse of the previous value of the iterator, in case of False it will get new value.
iterator_t = ds_t.make_initializable_iterator()
iterator_v = ds_v.make_initializable_iterator()
iterator_handle = tf.placeholder(tf.string, shape=[], name="iterator_handle")
iterator = tf.data.Iterator.from_string_handle(iterator_handle,
iterator_t.output_types,
iterator_t.output_shapes)
def get_next_item():
# sometimes items need casting
next_elem = iterator.get_next(name="next_element")
x, y = tf.cast(next_elem[0], tf.float32), next_elem[1]
return x, y
def old_data():
# just forward the existing batch
return inputs, target
is_keep_previous = tf.placeholder_with_default(tf.constant(False),shape=[], name="keep_previous_flag")
inputs, target = tf.cond(is_keep_previous, old_data, new_data)
with tf.Session() as sess:
sess.run([tf.global_variables_initializer(),tf.local_variables_initializer()])
handle_t = sess.run(iterator_t.string_handle())
handle_v = sess.run(iterator_v.string_handle())
# Run data iterator initialisation
sess.run(iterator_t.initializer)
sess.run(iterator_v.initializer)
while True:
try:
inputs_, target_ = sess.run([inputs, target], feed_dict={iterator_handle: handle_t, is_keep_previous:False})
print(inputs_, target_)
inputs_, target_ = sess.run([inputs, target], feed_dict={iterator_handle: handle_t, is_keep_previous:True})
print(inputs_, target_)
inputs_, target_ = sess.run([inputs, target], feed_dict={iterator_handle: handle_v})
print(inputs_, target_)
except tf.errors.OutOfRangeError:
# now we know we run out of elements in the validationiterator
break
Use control dependencies when building the graph for train_op_2 so it can see the updated values of the variables.
Or use eager execution.