How could I transform my dataset (composed of images) in a federated dataset?
I am trying to create something similar to emnist but for my own dataset.
tff.simulation.datasets.emnist.load_data(
only_digits=True, cache_dir=None )
You will need to create the clientData object first
for example:
client_data = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(client_ids,
create_dataset)
where create_dataset is a serializable function but first you have to prepare your images read this tutorial about preprocessing data
labels_tf = tf.convert_to_tensor(labels)
def parse_image(filename):
parts = tf.strings.split(filename, os.sep)
label_str = parts[-2]
label_int = tf.where(labels_tf == label_str)[0][0]
image = tf.io.read_file(filename)
image = tf.io.decode_jpeg(image,channels=3)
image = tf.image.convert_image_dtype(image, tf.float32)
image = tf.image.resize(image, [32, 32])
return image, label_int
When you prepared your data pass it to the create_dataset function
def create_dataset(client_id):
....
list_ds = tf.data.Dataset.list_files(<path of your dataset>)
images_ds = list_ds.map(parse_image)
return images_ds
after this step, you can make some preprocessing function
NUM_CLIENTS = 10
NUM_EPOCHS = 5
BATCH_SIZE = 20
SHUFFLE_BUFFER = 100
PREFETCH_BUFFER = 10
def preprocess(dataset):
return dataset.repeat(NUM_EPOCHS).shuffle(SHUFFLE_BUFFER, seed=1).batch(
BATCH_SIZE).prefetch(PREFETCH_BUFFER)
After this you could make a tf.data.Dataset which will be suitable for federated training.
def make_federated_data(client_data, client_ids):
return [
preprocess(client_data.create_tf_dataset_for_client(x))
for x in client_ids
]
After this your dataset is ready for federated learning!
Federated datasets in TFF are represented as ClientData objects. There are multiple subclasses that can be used depending on your dataset.
Two potentially relevant ways to create such objects:
Use ClientData.from_clients_and_tf_fn. This is useful for smaller datasets.
As a SqlClientData, which uses a SQL-file backing to improve performance. This can be done through tff.simulation.datasets.save_to_sql_client_data. Effectively, this allows you to do one-time work to create the client datasets and save the result, rather than having to reconstruct the datasets each time.
Note that both of these require TF-serializable functions for creating datasets from ids. If you just tensors you can use TestClientData, but this is intended only for small-scale datasets.
Related
I'm trying to load data for optimizing model for object detection + instance segmentation. However using tf.data.Dataset is giving me a bit headache with loading instance segmentations masks. tf.data.Dataset is using all the memory on the server (more than 128 GB) with a small dataset.
Is there a way to effectively load data in more memory efficient way, right now we are using this code:
train_dataset, train_examples = dataset.load_train_datasets()
ds = (
train_dataset.shuffle(min(100, train_examples), reshuffle_each_iteration=True)
.map(dataset.decode, num_parallel_calls=args.num_parallel_calls)
.map(train_processing.prepare_for_batch, num_parallel_calls=args.num_parallel_calls)
.batch(args.batch_size)
.map(train_processing.preprocess_batch, num_parallel_calls=args.num_parallel_calls)
.prefetch(AUTOTUNE)
)
The problem is that the second map call with train_processing.prepare_for_batch (takes single element) and third with train_processing.preprocess_batch (takes batch of elements) is creating a lot of binary masks for segmentation which are using all the memory.
Is there a way to reorganize the mapping functions to save the memory? I was thinking something like: 1. take first 100 samples, 2. decode the samples, 3. prepare the the masks and bounding boxes for one sample 4. takes the batch of them 5. final preparation of data per batch 6. FIT ONE step/one batch of data 7. clean the data from memory
Manually
First make a list of all the filenames in the dataset and a list of all the labels in the dataset.
filenames = [abc.png, def.png, ...]
labels = [0, 1, ...]
Then create dataset from tensor slices
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames))
dataset = dataset.map(PARSE_FUNCTION, num_parallel_calls=PARALLEL_CALLS)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)
Through a function
def dataset(csv, parse):
filenames = []
labels = []
for i, row in csv.iterrows():
filename = row[0]
filenames.append(filename)
label = row[1]
labels.append(label)
encoder = LabelEncoder()
labels = encoder.fit_transform(labels)
labels = np_utils.to_categorical(labels)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames))
dataset = dataset.map(PARSE_FUNCTION, num_parallel_calls=PARALLEL_CALLS)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)
return dataset
Disclaimer: this method assumes csv is in (filenames, label) format
I am trying reimplement some parts of Nvidia's noise2noise repo to learn tensorflow and the tf.data pipeline, and I am having a lot of trouble understanding what is happening. So far I am able to create a TFRecord consisting of tf.train.Example types as described in https://github.com/NVlabs/noise2noise/blob/master/dataset_tool_tf.py
image = load_image(imgname)
feature = {
'shape': shape_feature(image.shape),
'data': bytes_feature(tf.compat.as_bytes(image.tostring()))
}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
That part makes sense. What's driving me nuts is the noise augmentation piece in https://github.com/NVlabs/noise2noise/blob/master/dataset.py Specifically the function:
def create_dataset(train_tfrecords, minibatch_size, add_noise):
print ('Setting up dataset source from', train_tfrecords)
buffer_mb = 256
num_threads = 2
dset = tf.data.TFRecordDataset(train_tfrecords, compression_type='', buffer_size=buffer_mb<<20)
dset = dset.repeat()
buf_size = 1000
dset = dset.prefetch(buf_size)
dset = dset.map(parse_tfrecord_tf, num_parallel_calls=num_threads)
dset = dset.shuffle(buffer_size=buf_size)
dset = dset.map(lambda x: random_crop_noised_clean(x, add_noise))
dset = dset.batch(minibatch_size)
it = dset.make_one_shot_iterator()
return it
returns an iterator. This iterator is used in train.py and has three elements that are returned at every iteration:
noisy_input, noisy_target, clean_target = dataset_iter.get_next()
I've tried reimplementing this in a local tensorflow jupyter notebook, and I can't figure out where those three items are coming from. The way I understood it, the create_dataset(...) function just takes every input image in the Example record, and augments it with gaussian/poisson noise. But then why is the returned iterator pointing to three different images? What's the connection between the augmentation in create_dataset(...) and the three different images in the iterator?
I found this, which was really helpful in understanding map, batch, and shuffle: What does batch, repeat, and shuffle do with TensorFlow Dataset?
I want to train a convolutional neural network (using tf.keras from Tensorflow version 1.13) using numpy arrays as input data. The training data (which I currently store in a single >30GB '.npz' file) does not fit in RAM all at once. What is the best way to save and load large data-sets into a neural network for training? Since I didn't manage to find a good answer to this (surely ubiquitous?) problem, I'm hoping to hear one here. Thank you very much in advance for any help!
Sources
Similar questions seem to have been asked many times (e.g. training-classifier-from-tfrecords-in-tensorflow, tensorflow-synchronize-readings-from-tfrecord, how-to-load-data-parallelly-in-tensorflow) but are several years old and usually contain no conclusive answer.
My current understanding is that using TFRecord files is a good way to approach this problem. The most promising tutorial I found so far explaining how to use TFRecord files with keras is medium.com. Other helpful sources were machinelearninguru.com and medium.com_source2 and sources therin.
The official tensorflow documentation and tutorials (on tf.data.Dataset, Importing Data, tf_records etc.) did not help me. In particular, several of the examples given there didn't work for me even without modifications.
My Attempt at using TFRecord files
I'm assuming TFRecords are a good way to solve my problem but I'm having a hard time using them. Here is an example I made based on the tutorial medium.com. I stripped down the code as much as I could.
# python 3.6, tensorflow 1.13.
# Adapted from https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
import tensorflow as tf
import numpy as np
from tensorflow.python import keras as keras
# Helper functions (see also https://www.tensorflow.org/tutorials/load_data/tf_records)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def writeTFRecords():
number_of_samples = 100 # create some random data to play with
images, labels = (np.random.sample((number_of_samples, 256, 256, 1)), np.random.randint(0, 30, number_of_samples))
writer = tf.python_io.TFRecordWriter("bla.tfrecord")
for index in range(images.shape[0]):
image = images[index]
label = labels[index]
feature = {'image': _bytes_feature(tf.compat.as_bytes(image.tostring())),
'label': _int64_feature(int(label))}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
def loadTFRecord(data_path):
with tf.Session() as sess:
feature = {'train/image': tf.FixedLenFeature([], tf.string),
'train/label': tf.FixedLenFeature([], tf.int64)}
# Create a list of filenames and pass it to a queue
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
# Define a reader and read the next record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Decode the record read by the reader
features = tf.parse_single_example(serialized_example, features=feature)
# Convert the image data from string back to the numbers
image = tf.decode_raw(features['train/image'], tf.float32)
# Cast label data into int32
label = tf.cast(features['train/label'], tf.int32)
# Reshape image data into the original shape
image = tf.reshape(image, [256, 256, 1])
return image, label # I'm not 100% sure that's how this works...
# ######### generate a TFRecords file in the working directory containing random data. #################################
writeTFRecords()
# ######## Load the TFRecords file and use it to train a simple example neural network. ################################
image, label = loadTFRecord("bla.tfrecord")
model_input = keras.layers.Input(tensor=image)
model_output = keras.layers.Flatten(input_shape=(-1, 256, 256, 1))(model_input)
model_output = keras.layers.Dense(16, activation='relu')(model_output)
train_model = keras.models.Model(inputs=model_input, outputs=model_output)
train_model.compile(optimizer=keras.optimizers.RMSprop(lr=0.0001),
loss='mean_squared_error',
target_tensors=[label])
print("\n \n start training \n \n") # Execution gets stuck on fitting
train_model.fit(epochs=1, steps_per_epoch=10) # no output or error messages.
The code creates a TFRecord file and starts fitting, then just gets stuck with no output or error messages. I don't know what the problem is or how I could try to fix it.
While this is no real answer to the original question (i.e. "what is the optimal way to train on large datasets"), I managed to get tfrecords and datasets to work. Of particular help was this tutorial on YouTube. I include a minimal example with working code for anyone struggling with the same problem.
# Developed using python 3.6, tensorflow 1.14.0.
# This code writes data (pairs (label, image) where label is int64 and image is np.ndarray) into .tfrecord files and
# uses them for training a simple neural network. It is meant as a minimal working example of how to use tfrecords. This
# solution is likely not optimal. If you know how to improve it, please comment on
# https://stackoverflow.com/q/57717004/9988487. Refer to links therein for further information.
import tensorflow as tf
import numpy as np
from tensorflow.python import keras as keras
# Helper functions (see also https://www.tensorflow.org/tutorials/load_data/tf_records)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def write_tfrecords_file(out_path: str, images: np.ndarray, labels: np.ndarray) -> None:
"""Write all image-label pairs into a single .tfrecord file.
:param out_path: File path of the .tfrecord file to generate or overwrite.
:param images: array with first dimension being the image index. Every images[i].tostring() is
serialized and written into the file as 'image': wrap_bytes(img_bytes)
:param labels: 1d array of integers. labels[i] is the label of images[i]. Written as 'label': wrap_int64(label)"""
assert len(images) == len(labels)
with tf.io.TFRecordWriter(out_path) as writer: # could use writer_options parameter to enable compression
for i in range(len(labels)):
img_bytes = images[i].tostring() # Convert the image to raw bytes.
label = labels[i]
data = {'image': _bytes_feature(img_bytes), 'label': _int64_feature(label)}
feature = tf.train.Features(feature=data) # Wrap the data as TensorFlow Features.
example = tf.train.Example(features=feature) # Wrap again as a TensorFlow Example.
serialized = example.SerializeToString() # Serialize the data.
writer.write(serialized) # Write the serialized data to the TFRecords file.
def parse_example(serialized, shape=(256, 256, 1)):
features = {'image': tf.io.FixedLenFeature([], tf.string), 'label': tf.io.FixedLenFeature([], tf.int64)}
# Parse the serialized data so we get a dict with our data.
parsed_example = tf.io.parse_single_example(serialized=serialized, features=features)
label = parsed_example['label']
image_raw = parsed_example['image'] # Get the image as raw bytes.
image = tf.decode_raw(image_raw, tf.float32) # Decode the raw bytes so it becomes a tensor with type.
image = tf.reshape(image, shape=shape)
return image, label # this function will be called once (to add it to tf graph; then parse images individually)
# create some arbitrary data to play with: 1000 images sized 256x256 with one colour channel. Use your custom np-arrays
IMAGE_WIDTH, NUM_OF_IMAGES, NUM_OF_CLASSES, COLOUR_CHANNELS = 256, 10_000, 10, 1
# using float32 to save memory. Must match type in parse_example(), tf.decode_raw(image_raw, tf.float32)
features_train = np.random.sample((NUM_OF_IMAGES, IMAGE_WIDTH, IMAGE_WIDTH, COLOUR_CHANNELS)).astype(np.float32)
labels_train = np.random.randint(low=0, high=NUM_OF_CLASSES, size=NUM_OF_IMAGES) # one random label for each image
features_eval = features_train[:200] # use the first 200 images as evaluation data for simplicity.
labels_eval = labels_train[:200]
write_tfrecords_file("train.tfrecord", features_train, labels_train) # normal: split the data files of several GB each
write_tfrecords_file("eval.tfrecord", features_eval, labels_eval) # this may take a while. Consider a progressbar
# The files are complete. Now define a model and use datasets to feed the data from the .tfrecord files into the model.
model = keras.Sequential([keras.layers.Flatten(input_shape=(256, 256, 1)),
keras.layers.Dense(128, activation='relu'),
keras.layers.Dense(10, activation='softmax')])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Check docs for parameters (compression, buffer size, thread count. Also www.tensorflow.org/guide/performance/datasets
train_dataset = tf.data.TFRecordDataset("train.tfrecord") # specify a list (or dataset) of file names for large data
train_dataset = train_dataset.map(parse_example) # parse tfrecords. Parameter num_parallel_calls may help performance.
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
validation_dataset = tf.data.TFRecordDataset("eval.tfrecord")
validation_dataset = validation_dataset.map(parse_example).batch(64)
model.fit(train_dataset, epochs=3)
# evaluate the results
results = model.evaluate(validation_dataset)
print('\n\nvalidation loss, validation acc:', results)
Note that it's tricky to use some_keras_model.fit(..., validation_data=some_dataset) with dataset objects. It may result in
TypeError: 'DatasetV1Adapter' object does not support indexing.
This seems to be a bug (see github.com/tensorflow/tensorflow/issues/28995) and is supposedly fixed as of tf-nightly version '1.15.0-dev20190808'; The official tutorial uses this too, although it doesn't work in most versions. An easy but dirty-ish fix is to use verbose=0 (which only suppresses program output) and plot the validation results using tensorboard. Also see Keras model.fit() with tf.dataset API + validation_data.
I'm following this guide.
It shows how to download datasets from the new TensorFlow Datasets using tfds.load() method:
import tensorflow_datasets as tfds
SPLIT_WEIGHTS = (8, 1, 1)
splits = tfds.Split.TRAIN.subsplit(weighted=SPLIT_WEIGHTS)
(raw_train, raw_validation, raw_test), metadata = tfds.load(
'cats_vs_dogs', split=list(splits),
with_info=True, as_supervised=True)
The next steps shows how to apply a function to each item in the dataset using map method:
def format_example(image, label):
image = tf.cast(image, tf.float32)
image = image / 255.0
# Resize the image if required
image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
return image, label
train = raw_train.map(format_example)
validation = raw_validation.map(format_example)
test = raw_test.map(format_example)
Then to access the elements we can use:
for features in ds_train.take(1):
image, label = features["image"], features["label"]
OR
for example in tfds.as_numpy(train_ds):
numpy_images, numpy_labels = example["image"], example["label"]
However, the guide doesn't mention anything about data augmentation. I want to use real time data augmentation similar to that of Keras's ImageDataGenerator Class. I tried using:
if np.random.rand() > 0.5:
image = tf.image.flip_left_right(image)
and other similar augmentation functions in format_example() but, how can I verify that it's performing real time augmentation and not replacing the original image in the dataset?
I could convert the complete dataset to Numpy array by passing batch_size=-1 to tfds.load() and then use tfds.as_numpy() but, that would load all the images in memory which is not needed. I should be able to use train = train.prefetch(tf.data.experimental.AUTOTUNE) to load just enough data for next training loop.
You are approaching the problem from a wrong direction.
First, download data using tfds.load, cifar10 for example (for simplicity we will use default TRAIN and TEST splits):
import tensorflow_datasets as tfds
dataloader = tfds.load("cifar10", as_supervised=True)
train, test = dataloader["train"], dataloader["test"]
(you can use custom tfds.Split objects to create validations datasets or other, see documentation)
train and test are tf.data.Dataset objects so you can use map, apply, batch and similar functions to each of those.
Below is an example, where I will (using tf.image mostly):
convert each image to tf.float64 in the 0-1 range (don't use this stupid snippet from official docs, this way ensures correct image format)
cache() results as those can be re-used after each repeat
randomly flip left_to_right each image
randomly change contrast of image
shuffle data and batch
IMPORTANT: repeat all the steps when dataset is exhausted. This means that after one epoch all of the above transformations are applied again (except for the ones which were cached).
Here is the code doing the above (you can change lambdas to functors or functions):
train = train.map(
lambda image, label: (tf.image.convert_image_dtype(image, tf.float32), label)
).cache().map(
lambda image, label: (tf.image.random_flip_left_right(image), label)
).map(
lambda image, label: (tf.image.random_contrast(image, lower=0.0, upper=1.0), label)
).shuffle(
100
).batch(
64
).repeat()
Such tf.data.Dataset can be passed directly to Keras's fit, evaluate and predict methods.
Verifying it actually works like that
I see you are highly suspicious of my explanation, let's go through an example:
1. Get small subset of data
Here is one way to take a single element, admittedly unreadable and unintuitive, but you should be fine with it if you do anything with Tensorflow:
# Horrible API is horrible
element = tfds.load(
# Take one percent of test and take 1 element from it
"cifar10",
as_supervised=True,
split=tfds.Split.TEST.subsplit(tfds.percent[:1]),
).take(1)
2. Repeat data and check whether it is the same:
Using Tensorflow 2.0 one can actually do it without stupid workarounds (almost):
element = element.repeat(2)
# You can iterate through tf.data.Dataset now, finally...
images = [image[0] for image in element]
print(f"Are the same: {tf.reduce_all(tf.equal(images[0], images[1]))}")
And it unsurprisingly returns:
Are the same: True
3. Check whether data differs after each repeat with random augmentation
Below snippet repeats single element 5 times and checks which are equal and which are different.
element = (
tfds.load(
# Take one percent of test and take 1 element
"cifar10",
as_supervised=True,
split=tfds.Split.TEST.subsplit(tfds.percent[:1]),
)
.take(1)
.map(lambda image, label: (tf.image.random_flip_left_right(image), label))
.repeat(5)
)
images = [image[0] for image in element]
for i in range(len(images)):
for j in range(i, len(images)):
print(
f"{i} same as {j}: {tf.reduce_all(tf.equal(images[i], images[j]))}"
)
Output (in mine case, each run would be different):
0 same as 0: True
0 same as 1: False
0 same as 2: True
0 same as 3: False
0 same as 4: False
1 same as 1: True
1 same as 2: False
1 same as 3: True
1 same as 4: True
2 same as 2: True
2 same as 3: False
2 same as 4: False
3 same as 3: True
3 same as 4: True
4 same as 4: True
You could cast each of those images to numpy as well and see the images for yourself using skimage.io.imshow, matplotlib.pyplot.imshow or other alternatives.
Another example of visualization of real-time data augmentation
This answer provides a more comprehensive and readable view on data augmentation using Tensorboard and MNIST, might want to check that one out (yeah, shameless plug, but useful I guess).
I am using tf.keras to build my network. And I am doing all the augmentation in tensor_wise level since my data in tfrecords file. Then I needed to do shearing and zca for augmentation but couldn't find a proper implementation in tensor flow. And I can't use the DataImageGenerator that did both operation I needed because as I said my data doesn't fit in memory and it is in tfrecord format. So all my augmentations process should be tesnorwise.
#fchollet here suggested a way to use ImgaeDataGenerator with large dataset.
My first questino is
if I use #fchollet way, which is basically using X-sample of the large data to run the ImageDataGenerator then using train_on_batch to train the network , how I can feed my validation data to the network.
My Second question is there any tensor-wise implementation for shear and zca operations. Some people like here suggested using tf.contrib.image.transform but couldn't understand how. If some one have the idea on how to do it, I will appreciate that.
Update:
This is my trial to construct the transformation matrix through ski_image
from skimage import io
from skimage import transform as trans
import tensor flow as tf
def augment()
afine_tf = trans.AffineTransform(shear=0.2)
transform = tf.contrib.image.matrices_to_flat_transforms(tf.linalg.inv(afine_tf.params))
transform= tf.cast(transform, tf.float32)
image = tf.contrib.image.transform(image, transform) # Image here is a tensor
return image
dataset_train = tf.data.TFRecordDataset(training_files, num_parallel_reads=calls)
dataset_train = dataset_train.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=1000+ 4 * batch_size))
dataset_train = dataset_train.map(decode_train, num_parallel_calls= calls)
dataset_train = dataset_train.map(augment,num_parallel_calls=calls )
dataset_train = dataset_train.batch(batch_size)
dataset_train = dataset_train.prefetch(tf.contrib.data.AUTOTUNE)
I will answer the second question.
Today one of my old questions was commented by a user, but the comments have been deleted when I was adding more details on how to use tf.contrib.image.transform. I guess it's you, right?
So, I have edited my question and added an example, check it here.
TL;DR:
def transformImg(imgIn,forward_transform):
t = tf.contrib.image.matrices_to_flat_transforms(tf.linalg.inv(forward_transform))
# please notice that forward_transform must be a float matrix,
# e.g. [[2.0,0,0],[0,1.0,0],[0,0,1]] will work
# but [[2,0,0],[0,1,0],[0,0,1]] will not
imgOut = tf.contrib.image.transform(imgIn, t, interpolation="BILINEAR",name=None)
return imgOut
def shear_transform_example(filename,shear_lambda):
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_jpeg(image_string, channels=3)
img = transformImg(image_decoded, [[1.0,shear_lambda,0],[0,1.0,0],[0,0,1.0]])
# Notice that this is a shear transformation parallel to the x axis
# If you want a y axis version, use this:
# img = transformImg(image_decoded, [[1.0,0,0],[shear_lambda,1.0,0],[0,0,1.0]])
return img
img = shear_transform_example("white_square.jpg",0.1)