Difference between tf.data.Dataset.map() and tf.data.Dataset.apply()

Difference between tf.data.Dataset.map() and tf.data.Dataset.apply() - python

With the recent upgrade to version 1.4, Tensorflow included tf.data in the library core.
One "major new feature" described in the version 1.4 release notes is tf.data.Dataset.apply(), which is a "method for
applying custom transformation functions". How is this different from the already existing tf.data.Dataset.map()?

The difference is that map will execute one function on every element of the Dataset separately, whereas apply will execute one function on the whole Dataset at once (such as group_by_window given as example in the documentation).
The argument of apply is a function that takes a Dataset and returns a Dataset when the argument of map is a function that takes one element and returns one transformed element.

Sunreef's answer is absolutely correct. You might still be wondering why we introduced Dataset.apply(), and I thought I'd offer some background.
The tf.data API has a set of core transformations—like Dataset.map() and Dataset.filter()—that are generally useful across a wide range of datasets, unlikely to change, and implemented as methods on the tf.data.Dataset object. In particular, they are subject to the same backwards compatibility guarantees as other core APIs in TensorFlow.
However, the core approach is a bit restrictive. We also want the freedom to experiment with new transformations before adding them to the core, and to allow other library developers to create their own reusable transformations. Therefore, in TensorFlow 1.4 we split out a set of custom transformations that live in tf.contrib.data. The custom transformations include some that have very specific functionality (like tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()). Originally we implemented these custom transformations as functions from Dataset to Dataset, which had an unfortunate effect on the syntactic flow of a pipeline. For example:
dataset = tf.data.TFRecordDataset(...).map(...)
# Method chaining breaks when we apply a custom transformation.
dataset = custom_transformation(dataset, x, y, z)
dataset = dataset.shuffle(...).repeat(...).batch(...)
Since this seemed to be a common pattern, we added Dataset.apply() as a way to chain core and custom transformations in a single pipeline:
dataset = (tf.data.TFRecordDataset(...)
.map(...)
.apply(custom_transformation(x, y, z))
.shuffle(...)
.repeat(...)
.batch(...))
It's a minor feature in the grand scheme of things, but hopefully it helps to make tf.data programs easier to read, and the library easier to extend.

I don't have enough reputation to comment, but I just wanted to point out that you can actually use map to apply to multiple elements in a dataset contrary to #sunreef's comments on his own post.
According to the documentation, map takes as an argument
map_func: A function mapping a nested structure of tensors (having
shapes and types defined by self.output_shapes and self.output_types)
to another nested structure of tensors.
the output_shapes are defined by the dataset and can be modified by using api functions like batch. So, for example, you can do a batch normalization using only dataset.batch and .map with:
dataset = dataset ...
dataset.batch(batch_size)
dataset.map(normalize_fn)
It seems like the primary utility of apply() is when you really want to do a transformation across the entire dataset.

Simply, the arguement of transformation_func of apply() is Dataset; the arguement of map_func of map() is element

Related

Is there an alternative to tf.py_function() for custom Python code?

I have started using TensorFlow 2.0 and have a little uncertainty with regard to one aspect.
Suppose I have this use case: while ingesting data with the tf.data.Dataset I want to apply some specific augmentation operations upon some images. However, the external libraries that I am using require that the image is a numpy array, not a tensor.
When using tf.data.Dataset.from_tensor_slices(), the flowing data needs to be of type Tensor. Concrete example:
def my_function(tensor_image):
print(tensor_image.numpy()
return
data = tf.data.Dataset.from_tensor_slices(tensor_images).map(my_function)
The code above does not work yielding an
'Tensor' object has no attribute 'numpy' error.
I have read the documentation on TensorFlow 2.0 stating that if one wants to use an arbitrary python logic, one should use tf.py_function or only TensorFlow primitives according to:
How to convert "tensor" to "numpy" array in tensorflow?
My question is the following: Is there another way to use arbitrary python code in a function with a custom decorator/an easier way than to use tf.py_function?
To me honestly it seems that there must be a more elegant way than passing to a tf.py_function, transforming to a numpy array, perform operations A,B,C,D and then retransform to a tensor and yield the result.

There is no other way of doing it, because tf.data.Datasets are still (and they will always be, I suppose, for performance reasons) executed in graph mode and, thus, you cannot use anything outside of the tf.* methods, that can be easily converted by TensorFlow to its graph representation.
Using tf.py_function is the only way to mix Python execution (and thus, you can use any Python library) and graph execution when using a tf.data.Dataset object (on the contrary of what happens when using TensorFlow 2.0, that being eager by default allow this mixed execution naturally).

How to use TFF api's for custom usage?

I have read and studied the TFF guide and APIs pages precisely. But I am confused in some detail parts.
For example, when I want to wrap/decorate a TF/python function, use these two below APIs:
1. tff.tf_computation()
2. tff.federated_computation()
I can not find what are differences between them and when I am allowed to use them. Especially, in case I want to use other algorithms except for FedAvg or FedSgd. I wonder if you know:
How they could be used to manipulate inputs? do they work on #CLIENT or #SERVER?
How I could use them in another usage except for the output of tff.federated_mean or tff.federated_sum that the value will be in the server?
How I am able to have access to the detail of data and metrics in #CLIENT and #SERVER?
Why we should invoke the tff.tf_computation() from tff.federated_computation()? In this link, there was not any explanation about them.
Do these APIs (e.g. tff.federated_mean or tff.federated_sum) modify the output elements of each #CLIENT and bring them to the #SERVER?
Could anyone help me to understand intuitive behind the concept?

A possible rule of thumb about the different function decorators:
tff.tf_computation is for wrapping TF logic. Think "tensors in, tensors out": this should be very similar to the usage of tf.function, where the parameters and return values are tensors, or nested structures of tensors. TFF intrinsics (e.g. tff.federated_mean) cannot be used inside a tff.tf_computation, and tff.tf_computations cannot call tff.federated_computations. The type signature is always on unplaced.
tff.federated_computation should be used to wrap TFF programming abstractions. Think "tensors here, tensors there": Inside this context, a tff.tf_computation can be applied to tff.Values and tff.Values can be communicated to other placements using the intrinsics. The type signature can accept federated types (i.e. types with placements).
For your list of questions:
Both can work on values placed at CLIENTS or SERVER. For example, tff.tf_computation called my_comp can be applied to a value v with type int32#CLIENTS with tff.federated_map(my_comp, v), which will run my_comp on each client.
tff.federated_map() supports applying a computation pointwise (across clients) to data not on the server. You can manipulate the metrics on each client using tff.federated_map. TFF isn't intended for separate options on different clients; the abstractions do not support addressing individuals. You may be able to simulate this in Python, see Operations performed on the communications between the server and clients.
The values of placed data can be inspected in simulation simply by returning them from a tff.Computation, and invoking that computation. The values should be available in the Python environment.
tff.tf_computations should be invokable from anywhere, if there is documentation that says otherwise please point to it. I believe what was intended to highlight is that a tff.federated_computation may invoke a tff.tf_computation, but not vice versa.
The tutorials (Federated Learning for Image Classification and Federated Learning for Text Generation) show examples of printing out the metrics in simulation. You may also be interested in the answer to how to print local outputs in tensorflow federated?
tff.tf_computations should be executed directly if desired. This will avoid any of the federated part of TFF, and simply delegate to TensorFlow. To apply the computation to federated values and use in combination with federated intrinsics, they must be called inside a tff.federated_computation.

TensorFlow: Is it possible to map a function to a dataset using a for-loop?

I have a tf.data.TFRecordDataset and a (computationally expensive) function, which I want to map to it. I use TensorFlow 1.12 and eager execution, and the function uses NumPy ndarray interpretations of the tensors in my dataset using EagerTensor.numpy(). However, code inside functions that are given to tf.Dataset.map() are not executed eagerly, which is why the .numpy() conversion doesn't work there and .map() is not an option anymore. Is it possible to for-loop through a dataset and modify the examples in it? Simply assigning to them doesn't seem to work.

No, not exactly.
A Dataset is inherently lazily evaluated and cannot be assigned to in that way - conceptually try to think of it as a pipeline rather than a variable: each value is read, passed through any map() operations, batch() ops, etc and surfaced to the model as needed. To "assign" a value would be to write it to disk in the .tfrecord file and just isn't likely to ever be supported (these files are specifically designed to be fast-read not random-accessed).
You could, instead, use TensorFlow to do your pre-processing and use TfRecordWriter to write to a NEW tfrecord with the expensive pre-processing completed then use this new dataset as the input to your model. If you have the disk space avilable this might well be your best option.

Opportunistic caching with reusable custom graphs in Dask

Dask supports defining custom computational graphs as well as opportinistic caching. The question is how can they be used together.
For instance, let's define a very simple computational graph, that computes x+1 operation,
import dask
def compute(x):
graph = {'step1': (sum, [x, 1])}
return dask.get(graph, 'step1')
print('Cache disabled:', compute(1), compute(2))
this yields 2 and 3 as expected.
Now we enable opportunistic caching,
from dask.cache import Cache
cc = Cache(1e9)
cc.register()
print('Cache enabled: ', compute(1), compute(2))
print(cc.cache.data)
we get incorrectly a result of 2 in both cases, because cc.cache.data is {'step1': 2} irrespective of the input.
I imagine this means that the input needs to be hashed (e.g. with dask.base.tokenize and appended to all the keys in the graph. Is there a simpler way of doing it, particularly since the tokenize function is not part of the public API?
The issue is that in complex graphs, a random step name, needs to account for the hash of all the inputs provided to it's children steps, which means that it's necessary to do full graph resolution.

It's important that key names in dask graphs are unique (as you found above). Additionally, we'd like identical computations to have the same key so we can avoid computing them multiple times - this isn't necessary for dask to work though, it just provides some opportunities for optimization.
In dask's internals we make use of dask.base.tokenize to compute a "hash" of the inputs, resulting in deterministic key names. You are free to make use of this function as well. In the issue you linked above we say the function is public, just that the implementation might change (not the signature).
Also note that for many use cases, we recommend using dask.delayed now instead of custom graphs for generating custom computations. This will do the deterministic hashing for you behind the scenes.

Flink batch data processing

I'm evaluating Flink for some processing batches of data. As a simple example say I have 2000 points which I would like to pass through an FIR filter using functionality provided by scipy. The scipy filter is a simple function which accepts a set of coefficients and the data to filter and returns the data. Is is possible to create a transformation to handle this in Flink? It seems Flink transformations are applied on a point by point basis but I may be missing something.

This should certainly be possible. Flink already has a Python API (beta) you might want to use.
About your second question: Flink can apply a function point by point and can do other stuff, too. It depends what kink of function you are defining. For example, filter, project, map, flatMap are applied per record; max, min, reduce, etc. are applied to a group of records (the groups are defined via groupBy). There is also the possibility to join data from different dataset using join, cross, or cogroup. Please have a look into the list of available transformations in the documentation: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/dataset_transformations.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.