Understanding tf.data pipeline in NVLabs/noise2noise

Understanding tf.data pipeline in NVLabs/noise2noise - python

I am trying reimplement some parts of Nvidia's noise2noise repo to learn tensorflow and the tf.data pipeline, and I am having a lot of trouble understanding what is happening. So far I am able to create a TFRecord consisting of tf.train.Example types as described in https://github.com/NVlabs/noise2noise/blob/master/dataset_tool_tf.py
image = load_image(imgname)
feature = {
'shape': shape_feature(image.shape),
'data': bytes_feature(tf.compat.as_bytes(image.tostring()))
}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
That part makes sense. What's driving me nuts is the noise augmentation piece in https://github.com/NVlabs/noise2noise/blob/master/dataset.py Specifically the function:
def create_dataset(train_tfrecords, minibatch_size, add_noise):
print ('Setting up dataset source from', train_tfrecords)
buffer_mb = 256
num_threads = 2
dset = tf.data.TFRecordDataset(train_tfrecords, compression_type='', buffer_size=buffer_mb<<20)
dset = dset.repeat()
buf_size = 1000
dset = dset.prefetch(buf_size)
dset = dset.map(parse_tfrecord_tf, num_parallel_calls=num_threads)
dset = dset.shuffle(buffer_size=buf_size)
dset = dset.map(lambda x: random_crop_noised_clean(x, add_noise))
dset = dset.batch(minibatch_size)
it = dset.make_one_shot_iterator()
return it
returns an iterator. This iterator is used in train.py and has three elements that are returned at every iteration:
noisy_input, noisy_target, clean_target = dataset_iter.get_next()
I've tried reimplementing this in a local tensorflow jupyter notebook, and I can't figure out where those three items are coming from. The way I understood it, the create_dataset(...) function just takes every input image in the Example record, and augments it with gaussian/poisson noise. But then why is the returned iterator pointing to three different images? What's the connection between the augmentation in create_dataset(...) and the three different images in the iterator?
I found this, which was really helpful in understanding map, batch, and shuffle: What does batch, repeat, and shuffle do with TensorFlow Dataset?

Related

How can I prepare my image dataset for a federated model?

How could I transform my dataset (composed of images) in a federated dataset?
I am trying to create something similar to emnist but for my own dataset.
tff.simulation.datasets.emnist.load_data(
only_digits=True, cache_dir=None )

You will need to create the clientData object first
for example:
client_data = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(client_ids,
create_dataset)
where create_dataset is a serializable function but first you have to prepare your images read this tutorial about preprocessing data
labels_tf = tf.convert_to_tensor(labels)
def parse_image(filename):
parts = tf.strings.split(filename, os.sep)
label_str = parts[-2]
label_int = tf.where(labels_tf == label_str)[0][0]
image = tf.io.read_file(filename)
image = tf.io.decode_jpeg(image,channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
image = tf.image.resize(image, [32, 32])
return image, label_int
When you prepared your data pass it to the create_dataset function
def create_dataset(client_id):
....
list_ds = tf.data.Dataset.list_files(<path of your dataset>)
images_ds = list_ds.map(parse_image)
return images_ds
after this step, you can make some preprocessing function
NUM_CLIENTS = 10
NUM_EPOCHS = 5
BATCH_SIZE = 20
SHUFFLE_BUFFER = 100
PREFETCH_BUFFER = 10
def preprocess(dataset):
return dataset.repeat(NUM_EPOCHS).shuffle(SHUFFLE_BUFFER, seed=1).batch(
BATCH_SIZE).prefetch(PREFETCH_BUFFER)
After this you could make a tf.data.Dataset which will be suitable for federated training.
def make_federated_data(client_data, client_ids):
return [
preprocess(client_data.create_tf_dataset_for_client(x))
for x in client_ids
]
After this your dataset is ready for federated learning!

Federated datasets in TFF are represented as ClientData objects. There are multiple subclasses that can be used depending on your dataset.
Two potentially relevant ways to create such objects:
Use ClientData.from_clients_and_tf_fn. This is useful for smaller datasets.
As a SqlClientData, which uses a SQL-file backing to improve performance. This can be done through tff.simulation.datasets.save_to_sql_client_data. Effectively, this allows you to do one-time work to create the client datasets and save the result, rather than having to reconstruct the datasets each time.
Note that both of these require TF-serializable functions for creating datasets from ids. If you just tensors you can use TestClientData, but this is intended only for small-scale datasets.

How to get Data Generator more efficient?

To train a neural network, I modified a code I found on YouTube. It looks as follows:
def data_generator(samples, batch_size, shuffle_data = True, resize=224):
num_samples = len(samples)
while True:
random.shuffle(samples)
for offset in range(0, num_samples, batch_size):
batch_samples = samples[offset: offset + batch_size]
X_train = []
y_train = []
for batch_sample in batch_samples:
img_name = batch_sample[0]
label = batch_sample[1]
img = cv2.imread(os.path.join(root_dir, img_name))
#img, label = preprocessing(img, label, new_height=224, new_width=224, num_classes=37)
img = preprocessing(img, new_height=224, new_width=224)
label = my_onehot_encoded(label)
X_train.append(img)
y_train.append(label)
X_train = np.array(X_train)
y_train = np.array(y_train)
yield X_train, y_train
Now, I tried to train a neural network using this code, train sample size is 105.000 (image files which contain 8 characters out of 37 possibilities, A-Z, 0-9 and blank space).
I used a relatively small batch size (32, I think that is already too small) to get it more efficient but nevertheless it took like forever to train one quarter of the first epoch (I had 826 steps per epoch, and it took 90 minutes for 199 steps... steps_per_epoch = num_train_samples // batch_size).
The following functions are included in the data generator:
def shuffle_data(data):
data=random.shuffle(data)
return data
I don't think we can make this function anyhow more efficient or exclude it from the generator.
def preprocessing(img, new_height, new_width):
img = cv2.resize(img,(new_height, new_width))
img = img/255
return img
For preprocessing/resizing the data I use this code to get the images to a unique size of e.g. (224, 224, 3). I think, this part of the generator takes the most time, but I don't see a possibility to exclude it from the generator (since my memory would be full, if we resize the images outside the batches).
#One Hot Encoding of the Labels
from numpy import argmax
# define input string
def my_onehot_encoded(label):
# define universe of possible input values
characters = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ '
# define a mapping of chars to integers
char_to_int = dict((c, i) for i, c in enumerate(characters))
int_to_char = dict((i, c) for i, c in enumerate(characters))
# integer encode input data
integer_encoded = [char_to_int[char] for char in label]
# one hot encode
onehot_encoded = list()
for value in integer_encoded:
character = [0 for _ in range(len(characters))]
character[value] = 1
onehot_encoded.append(character)
return onehot_encoded
I think, in this part there could be one approach to make it more efficient. I am thinking about to exclude this code from the generator and produce the array y_train outside of the generator, so that the generator does not have to one hot encode the labels every time.
What do you think? Or should I maybe go for a completely different approach?

I have found your question very intriguing because you give only clues. So here is my investigation.
Using your snippets, I have found GitHub repository and 3 part video tutorial on YouTube that mainly focuses on the benefits of using generator functions in Python.
The data is based on this kaggle (I would recommend to check out different kernels on that problem to compare the approach that you already tried with another CNN networks and review API in use).
You do not need to write a data generator from scratch, though it is not hard, but inventing the wheel is not productive.
Keras has the ImageDataGenerator class.
Plus here is a more generic example for DataGenerator.
Tensorflow offers very neat pipelines with their tf.data.Dataset.
Nevertheless, to solve the kaggle's task, the model needs to perceive single images only, hence the model is a simple deep CNN. But as I understand, you are combining 8 random characters (classes) into one image to recognize multiple classes at once. For that task, you need R-CNN or YOLO as your model. I just recently opened for myself YOLO v4, and it is possible to make it work for specific task really quick.
General advice about your design and code.
Make sure the library uses GPU. It saves a lot of time. (Even though I repeated flowers experiment from the repository very fast on CPU - about 10 minutes, but resulting predictions are no better than a random guess. So full training requires a lot of time on CPU.)
Compare different versions to find a bottleneck. Try a dataset with 48 images (1 per class), increase the number of images per class, and compare. Reduce image size, change the model structure, etc.
Test brand new models on small, artificial data to prove the idea or use iterative process, start from projects that can be converted to your task (handwriting recognition?).

Tensorflow: Modern way to load large data

I want to train a convolutional neural network (using tf.keras from Tensorflow version 1.13) using numpy arrays as input data. The training data (which I currently store in a single >30GB '.npz' file) does not fit in RAM all at once. What is the best way to save and load large data-sets into a neural network for training? Since I didn't manage to find a good answer to this (surely ubiquitous?) problem, I'm hoping to hear one here. Thank you very much in advance for any help!
Sources
Similar questions seem to have been asked many times (e.g. training-classifier-from-tfrecords-in-tensorflow, tensorflow-synchronize-readings-from-tfrecord, how-to-load-data-parallelly-in-tensorflow) but are several years old and usually contain no conclusive answer.
My current understanding is that using TFRecord files is a good way to approach this problem. The most promising tutorial I found so far explaining how to use TFRecord files with keras is medium.com. Other helpful sources were machinelearninguru.com and medium.com_source2 and sources therin.
The official tensorflow documentation and tutorials (on tf.data.Dataset, Importing Data, tf_records etc.) did not help me. In particular, several of the examples given there didn't work for me even without modifications.
My Attempt at using TFRecord files
I'm assuming TFRecords are a good way to solve my problem but I'm having a hard time using them. Here is an example I made based on the tutorial medium.com. I stripped down the code as much as I could.
# python 3.6, tensorflow 1.13.
# Adapted from https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
import tensorflow as tf
import numpy as np
from tensorflow.python import keras as keras
# Helper functions (see also https://www.tensorflow.org/tutorials/load_data/tf_records)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def writeTFRecords():
number_of_samples = 100 # create some random data to play with
images, labels = (np.random.sample((number_of_samples, 256, 256, 1)), np.random.randint(0, 30, number_of_samples))
writer = tf.python_io.TFRecordWriter("bla.tfrecord")
for index in range(images.shape[0]):
image = images[index]
label = labels[index]
feature = {'image': _bytes_feature(tf.compat.as_bytes(image.tostring())),
'label': _int64_feature(int(label))}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
def loadTFRecord(data_path):
with tf.Session() as sess:
feature = {'train/image': tf.FixedLenFeature([], tf.string),
'train/label': tf.FixedLenFeature([], tf.int64)}
# Create a list of filenames and pass it to a queue
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
# Define a reader and read the next record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Decode the record read by the reader
features = tf.parse_single_example(serialized_example, features=feature)
# Convert the image data from string back to the numbers
image = tf.decode_raw(features['train/image'], tf.float32)
# Cast label data into int32
label = tf.cast(features['train/label'], tf.int32)
# Reshape image data into the original shape
image = tf.reshape(image, [256, 256, 1])
return image, label # I'm not 100% sure that's how this works...
# ######### generate a TFRecords file in the working directory containing random data. #################################
writeTFRecords()
# ######## Load the TFRecords file and use it to train a simple example neural network. ################################
image, label = loadTFRecord("bla.tfrecord")
model_input = keras.layers.Input(tensor=image)
model_output = keras.layers.Flatten(input_shape=(-1, 256, 256, 1))(model_input)
model_output = keras.layers.Dense(16, activation='relu')(model_output)
train_model = keras.models.Model(inputs=model_input, outputs=model_output)
train_model.compile(optimizer=keras.optimizers.RMSprop(lr=0.0001),
loss='mean_squared_error',
target_tensors=[label])
print("\n \n start training \n \n") # Execution gets stuck on fitting
train_model.fit(epochs=1, steps_per_epoch=10) # no output or error messages.
The code creates a TFRecord file and starts fitting, then just gets stuck with no output or error messages. I don't know what the problem is or how I could try to fix it.

While this is no real answer to the original question (i.e. "what is the optimal way to train on large datasets"), I managed to get tfrecords and datasets to work. Of particular help was this tutorial on YouTube. I include a minimal example with working code for anyone struggling with the same problem.
# Developed using python 3.6, tensorflow 1.14.0.
# This code writes data (pairs (label, image) where label is int64 and image is np.ndarray) into .tfrecord files and
# uses them for training a simple neural network. It is meant as a minimal working example of how to use tfrecords. This
# solution is likely not optimal. If you know how to improve it, please comment on
# https://stackoverflow.com/q/57717004/9988487. Refer to links therein for further information.
import tensorflow as tf
import numpy as np
from tensorflow.python import keras as keras
# Helper functions (see also https://www.tensorflow.org/tutorials/load_data/tf_records)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def write_tfrecords_file(out_path: str, images: np.ndarray, labels: np.ndarray) -> None:
"""Write all image-label pairs into a single .tfrecord file.
:param out_path: File path of the .tfrecord file to generate or overwrite.
:param images: array with first dimension being the image index. Every images[i].tostring() is
serialized and written into the file as 'image': wrap_bytes(img_bytes)
:param labels: 1d array of integers. labels[i] is the label of images[i]. Written as 'label': wrap_int64(label)"""
assert len(images) == len(labels)
with tf.io.TFRecordWriter(out_path) as writer: # could use writer_options parameter to enable compression
for i in range(len(labels)):
img_bytes = images[i].tostring() # Convert the image to raw bytes.
label = labels[i]
data = {'image': _bytes_feature(img_bytes), 'label': _int64_feature(label)}
feature = tf.train.Features(feature=data) # Wrap the data as TensorFlow Features.
example = tf.train.Example(features=feature) # Wrap again as a TensorFlow Example.
serialized = example.SerializeToString() # Serialize the data.
writer.write(serialized) # Write the serialized data to the TFRecords file.
def parse_example(serialized, shape=(256, 256, 1)):
features = {'image': tf.io.FixedLenFeature([], tf.string), 'label': tf.io.FixedLenFeature([], tf.int64)}
# Parse the serialized data so we get a dict with our data.
parsed_example = tf.io.parse_single_example(serialized=serialized, features=features)
label = parsed_example['label']
image_raw = parsed_example['image'] # Get the image as raw bytes.
image = tf.decode_raw(image_raw, tf.float32) # Decode the raw bytes so it becomes a tensor with type.
image = tf.reshape(image, shape=shape)
return image, label # this function will be called once (to add it to tf graph; then parse images individually)
# create some arbitrary data to play with: 1000 images sized 256x256 with one colour channel. Use your custom np-arrays
IMAGE_WIDTH, NUM_OF_IMAGES, NUM_OF_CLASSES, COLOUR_CHANNELS = 256, 10_000, 10, 1
# using float32 to save memory. Must match type in parse_example(), tf.decode_raw(image_raw, tf.float32)
features_train = np.random.sample((NUM_OF_IMAGES, IMAGE_WIDTH, IMAGE_WIDTH, COLOUR_CHANNELS)).astype(np.float32)
labels_train = np.random.randint(low=0, high=NUM_OF_CLASSES, size=NUM_OF_IMAGES) # one random label for each image
features_eval = features_train[:200] # use the first 200 images as evaluation data for simplicity.
labels_eval = labels_train[:200]
write_tfrecords_file("train.tfrecord", features_train, labels_train) # normal: split the data files of several GB each
write_tfrecords_file("eval.tfrecord", features_eval, labels_eval) # this may take a while. Consider a progressbar
# The files are complete. Now define a model and use datasets to feed the data from the .tfrecord files into the model.
model = keras.Sequential([keras.layers.Flatten(input_shape=(256, 256, 1)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Check docs for parameters (compression, buffer size, thread count. Also www.tensorflow.org/guide/performance/datasets
train_dataset = tf.data.TFRecordDataset("train.tfrecord") # specify a list (or dataset) of file names for large data
train_dataset = train_dataset.map(parse_example) # parse tfrecords. Parameter num_parallel_calls may help performance.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
validation_dataset = tf.data.TFRecordDataset("eval.tfrecord")
validation_dataset = validation_dataset.map(parse_example).batch(64)
model.fit(train_dataset, epochs=3)
# evaluate the results
results = model.evaluate(validation_dataset)
print('\n\nvalidation loss, validation acc:', results)
Note that it's tricky to use some_keras_model.fit(..., validation_data=some_dataset) with dataset objects. It may result in
TypeError: 'DatasetV1Adapter' object does not support indexing.
This seems to be a bug (see github.com/tensorflow/tensorflow/issues/28995) and is supposedly fixed as of tf-nightly version '1.15.0-dev20190808'; The official tutorial uses this too, although it doesn't work in most versions. An easy but dirty-ish fix is to use verbose=0 (which only suppresses program output) and plot the validation results using tensorboard. Also see Keras model.fit() with tf.dataset API + validation_data.

How to use dataset.map in Tensorflow Data API correctly

I'm learning how to use the Tensorflow Data API, and struggle to understand how mapping works. For the context, I want to load a dataset of images and send them to a neural network.
The MWE below does that in idea (fake dataset of size 10, read_image function mapped to the dataset).
import tensorflow as tf
import numpy as np
def read_image(filename, label):
return np.random.rand(8, 8, 1), label # simulate data load (generate random data)
# generate fake dataset of filenames (of size 10)
filenames = tf.constant(np.asarray(["file" + str(i) for i in range(10)]))
labels = tf.constant(np.asarray([2*i for i in range(10)]))
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(read_image)
dataset = dataset.repeat().batch(2)
iterator = tf.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes)
X, y = iterator.get_next()
train_init_op = iterator.make_initializer(dataset)
with tf.Session() as session:
tf.global_variables_initializer().run()
session.run(train_init_op)
for _ in range(10):
print(session.run([X]))
When running this code (that is supposed to do nothing, just print the values generated by read_image), it ends up with always the same data: read_image is called only once. Why is that ? I used dataset.map, isn't it supposed to be called on every element of my dataset (10 here) ?
Thanks in advance for any help or advice.

My understanding of it was wrong, I actually figured out how to make it work. This map function is "integrated" to the Tensorflow graph, hence indeed called just once. One need to use TF operations in it.
If read_image is replaced by:
def read_image(filename, label):
return tf.random_normal([4]), tf.random_normal([1])
It works as expected (random values are generated each time session.run is called).

How to get two tf.dataset from tf.data.Dataset.zip((images, labels))

I am working on the Python/tensorflow/mnist tutorial.
Since a few weeks using the orignal code from tensorflow web site i get the warning that the image dataset would soon be deprecated abd that i should use the following one :
https://github.com/tensorflow/models/blob/master/official/mnist/dataset.py
I load it it my code using :
from tensorflow.models.official.mnist import dataset
trainfile = dataset.train(data_dir)
Which returns :
tf.data.Dataset.zip((images, labels))
The issue is that I cannot find a,way to separate them in the following way for example :
trainfile = dataset.train(data_dir)
train_data= trainfile.images
train_label= trainfile.label
But this clearly doesnot work because the attributrs images and label do not exist. trainfile is a tf.dataset.
Knowing that tf.dataset is made of int32 and float32 i tried :
train_data = trainfile.map(lambda x,y : x.dtype == tf.float32)
But it returns and empty dataset.
I insist (but will be open mimded) in doing it this way (two complete batches of image and label) because this is how the tutorial works :
https://www.tensorflow.org/tutorials/estimators/cnn
I saw a lot of solution to get elements from datasets but nothing to go back from the zip operations that is done in the following code
tf.data.Dataset.zip((images, labels))
Thanks you in advance for your help.

I hope this helps:
inputs = tf.placeholder(tf.float32, shape=(None, 784), name='inputs')
outputs = tf.placeholder(tf.float32, shape=(None,), name='outputs')
#Prepare a tensorflow dataset
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.shuffle(buffer_size=10, reshuffle_each_iteration=True).batch(batch_size=batch_size, drop_remainder=True).repeat()
iter = ds.make_one_shot_iterator()
next = iter.get_next()
inputs = next[0]
outputs = next[1]

TensorFlow's get_single_element() is finally around which can be used to unzip features and labels from the dataset.
This avoids the need of generating and using an iterator using .map(), iter() or one_shot_iterator() (which could be costly for big datasets).
get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.
This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.
Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.

Instead of separating into two datasets, one for images and another for labels, it's best to make a single iterator which returns both the image and the label.
The reason why this is preferred is that it's a lot easier to ensure that you match each example with its label even after a complicated series of shuffles, reorderings, filterings, etc, as you might have in a nontrivial input pipeline.

You can visualize images and find its associated labels
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.shuffle(buffer_size=10).batch(batch_size=batch_size)
iter = ds.make_one_shot_iterator()
next = iter.get_next()
def display(image, label):
# display image
...
plt.imshow(image)
...
with tf.Session() as sess:
try:
while True:
image, label = sess.run(next)
# image = numpy array (batch, image_size)
# label = numpy array (batch, label)
display(image[0], label[0]) #display first image in batch
except:
pass

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.