What is tracing with regard to tf.function - python

The word "tracing" is mentioned frequently in TensorFlow's guide like Better performance with tf.function
What is "tracing" exactly, does it refer to generating the graph as a result of
calling the tf.function for the first time (and subsequently
depending on the arguments)?
What happens when only part of the computation is annotated with
#tf.function, will it mix eager execution with graph execution?

Yes, "tracing" means to run a Python function and "record" its TensorFlow operations in a graph. Note the traced code may not exactly correspond to the written Python code, if Autograph has performed some transformation. Tracing is ideally only done once, the first time the function is called, so subsequent calls can directly use the traced graph and save the Python code execution. As you say, though, future calls may require retracing the function depending on the given arguments, as explained in the link you posted.
You can call a #tf.function from a function that works in eager mode, in which case, yes, it will sort of "mix" both modes. But if you call an unnanotated function from a #tf.function, then its code will also be traced - that is, you cannot temporarily go back to eager/Python mode from within a #tf.function. That is the reason why, at some point, there was the suggestion that you only needed to annotate higher-level functions, because the lower-level ones would be "graphed" too anyway - although it's not so clear-cut when one should or should not annotate a function, see Should I use #tf.function for all functions? and this GitHub discussion.
EDIT: When I say "you cannot temporarily go back to eager/Python mode from within a #tf.function", I mean #tf.function cannot go out of "traced" mode. Of course, using tf.numpy_function or tf.py_function you can have a traced function that uses eager/Python mode, which will be encapsulated in an operation as part of the traced graph.

Related

How to use TFF api's for custom usage?

I have read and studied the TFF guide and APIs pages precisely. But I am confused in some detail parts.
For example, when I want to wrap/decorate a TF/python function, use these two below APIs:
1. tff.tf_computation()
2. tff.federated_computation()
I can not find what are differences between them and when I am allowed to use them. Especially, in case I want to use other algorithms except for FedAvg or FedSgd. I wonder if you know:
How they could be used to manipulate inputs? do they work on #CLIENT or #SERVER?
How I could use them in another usage except for the output of tff.federated_mean or tff.federated_sum that the value will be in the server?
How I am able to have access to the detail of data and metrics in #CLIENT and #SERVER?
Why we should invoke the tff.tf_computation() from tff.federated_computation()? In this link, there was not any explanation about them.
Do these APIs (e.g. tff.federated_mean or tff.federated_sum) modify the output elements of each #CLIENT and bring them to the #SERVER?
Could anyone help me to understand intuitive behind the concept?
A possible rule of thumb about the different function decorators:
tff.tf_computation is for wrapping TF logic. Think "tensors in, tensors out": this should be very similar to the usage of tf.function, where the parameters and return values are tensors, or nested structures of tensors. TFF intrinsics (e.g. tff.federated_mean) cannot be used inside a tff.tf_computation, and tff.tf_computations cannot call tff.federated_computations. The type signature is always on unplaced.
tff.federated_computation should be used to wrap TFF programming abstractions. Think "tensors here, tensors there": Inside this context, a tff.tf_computation can be applied to tff.Values and tff.Values can be communicated to other placements using the intrinsics. The type signature can accept federated types (i.e. types with placements).
For your list of questions:
Both can work on values placed at CLIENTS or SERVER. For example, tff.tf_computation called my_comp can be applied to a value v with type int32#CLIENTS with tff.federated_map(my_comp, v), which will run my_comp on each client.
tff.federated_map() supports applying a computation pointwise (across clients) to data not on the server. You can manipulate the metrics on each client using tff.federated_map. TFF isn't intended for separate options on different clients; the abstractions do not support addressing individuals. You may be able to simulate this in Python, see Operations performed on the communications between the server and clients.
The values of placed data can be inspected in simulation simply by returning them from a tff.Computation, and invoking that computation. The values should be available in the Python environment.
tff.tf_computations should be invokable from anywhere, if there is documentation that says otherwise please point to it. I believe what was intended to highlight is that a tff.federated_computation may invoke a tff.tf_computation, but not vice versa.
The tutorials (Federated Learning for Image Classification and Federated Learning for Text Generation) show examples of printing out the metrics in simulation. You may also be interested in the answer to how to print local outputs in tensorflow federated?
tff.tf_computations should be executed directly if desired. This will avoid any of the federated part of TFF, and simply delegate to TensorFlow. To apply the computation to federated values and use in combination with federated intrinsics, they must be called inside a tff.federated_computation.

TensorFlow: Is it possible to map a function to a dataset using a for-loop?

I have a tf.data.TFRecordDataset and a (computationally expensive) function, which I want to map to it. I use TensorFlow 1.12 and eager execution, and the function uses NumPy ndarray interpretations of the tensors in my dataset using EagerTensor.numpy(). However, code inside functions that are given to tf.Dataset.map() are not executed eagerly, which is why the .numpy() conversion doesn't work there and .map() is not an option anymore. Is it possible to for-loop through a dataset and modify the examples in it? Simply assigning to them doesn't seem to work.
No, not exactly.
A Dataset is inherently lazily evaluated and cannot be assigned to in that way - conceptually try to think of it as a pipeline rather than a variable: each value is read, passed through any map() operations, batch() ops, etc and surfaced to the model as needed. To "assign" a value would be to write it to disk in the .tfrecord file and just isn't likely to ever be supported (these files are specifically designed to be fast-read not random-accessed).
You could, instead, use TensorFlow to do your pre-processing and use TfRecordWriter to write to a NEW tfrecord with the expensive pre-processing completed then use this new dataset as the input to your model. If you have the disk space avilable this might well be your best option.

Inspecting the values of constant or variable tensors during debug

I implemented a model in TensorFlow (Python) that I previously programmed in C++ using Eigen, where it worked as expected. But the model is not working as expected in Python, and it's probably because I am defining tensors incorrectly or I am mixing up dimensions.
I am trying to get a feel for the problems by using Visual Studio's (2017) debugger (if a different IDE is better for this then I'm all ears, but I would prefer to stick with VS), but tensors do not evaluate to anything - and I can understand this because the tensor defines an operation and not a data object (well it only produces a data object after calling a session.run).
However, constant and variable tensors - and any other tensors built solely on top of such tensors - come with predefined data. So hey, why not be able to inspect the value through the debugging UI?
So my question: is there a way to inpect the data with some extension?
For example, if I was working in C++ and with Eigen, I can use Eigen.natvis as described here. Anything similar for TensorFlow? It's not just a matter of seeing the evaluated value, either. It would be nice to see things like the shape, etc... while debugging.
I would also be open to other debugging techniques of TensorFlow code, if anyone has a good suggestion.
TensorFlow includes tfdbg, a debugger for TensorFlow models, where you can step through each execution step, check values, stop on NaN, etc. See the programmer's guide TensorFlow Debugger and The Debugger Dashboard for more information.
tfdbg can be a bit cumbersome to setup and use though. A quick alternative to check intermediate values is to use tf.Print operations. TensorFlow includes a few other debugging operations that you may find useful to check for some basic things.
EDIT: Another tool that can be useful is eager execution. This allows you to use TensorFlow operations as if they were regular Python operations (they return the result of the operation instead of the graph object), so it is a good way to check if some particular code does what you expect.

When is it safe to cache tf.Tensors?

Let's say we have some method foo we call during graph construction time that returns some tf.Tensors or a nested structure of them every time is called, and multiple other methods that make use of foo's result. For efficiency and to avoid spamming the TF graph with unnecessary repeated operations, it might be tempting to make foo cache its result (to reuse the subgraph it produces) the first time is called. However, that will fail if foo is ever used in the context of a control flow, like tf.cond, tf.map_fn or tf.while_loop.
My questions are:
When is it safe to cache tf.Tensor objects in such a way that does not cause problems with control flows? Perhaps is there some way to retrieve the control flow under which a tf.Tensor was created (if any), store it and compare it later to see if a cached result can be reused?
How would the answer to the question above apply to tf.Operations?
(Question text updated to make clearer that foo creates a new set of tensors every time is called)
TL;DR: TF already caches what it needs to, don't bother with it yourself.
Every time you call sess.run([some_tensors]) TF's engine find the minimum subgraph needed to compute all tensors in [some_tensors] and runs it from top to bottom (possibly on new data, if you're not feeding it the same data).
That means, caching of results in-between sess.run calls is useless towards saving computation, because they will be recomputed anyway.
If, instead, you're concerned with having multiple tensors using the same data as input in one call of sess.run, don't worry, TF is smart enough. if you have input A and B = 2*A, C = A + 1, as long as you do one sess.run call as sess.run([B,C]) A will be evaluated only once (and then implicitly cached by the TF engine).

Saving tensorflow-slim model using steps intervals instead of time interval (seconds)

I am using object detection in tensorflow api. In my previous work I used to check the current step and save my model every n steps, something like the approach mentioned here.
In this case though the authors use TensorFlow-Slim to perform the training. So, they use a tf.train.Saver which is passed to the actual function which performs the training: slim.learning.train(). While this function has some parameters regarding the interval for writing down the training model using the parameter save_interval_secs it is time dependent and not step dependent.
So, since tf.train.Saver is a "passive" utility as mentioned here and just saves a model with the provided parameters, meaning is ignorant of any time or step notions, and also in object detection code the control is passed in TensorFlow-Slim, by passing the saver as a parameter, in this case how can I achieve to save my model step-ward (every n steps instead of every x seconds)?
The only solution is to dig into slim code and edit it (with all risks coming from this)? Or is there another option I am not familiar with?
P.S.1
I found out there is a astonishingly similar question about this option here but unfortunately it did not have any answers. So, since my problem persists I will leave this question intact to raise some interest in the issue.
P.S.2
Looking into slim.learning code I found out that in train() after passing the parameters it just passes the control over to supervisor.Supervisor which refers to tf.train.Supervisor which is a bit odd since this class is considered deprecated. The use of supervisor is also mentioned in the docstrings of slim.learning.

Categories