I have read and studied the TFF guide and APIs pages precisely. But I am confused in some detail parts.
For example, when I want to wrap/decorate a TF/python function, use these two below APIs:
1. tff.tf_computation()
2. tff.federated_computation()
I can not find what are differences between them and when I am allowed to use them. Especially, in case I want to use other algorithms except for FedAvg or FedSgd. I wonder if you know:
How they could be used to manipulate inputs? do they work on #CLIENT or #SERVER?
How I could use them in another usage except for the output of tff.federated_mean or tff.federated_sum that the value will be in the server?
How I am able to have access to the detail of data and metrics in #CLIENT and #SERVER?
Why we should invoke the tff.tf_computation() from tff.federated_computation()? In this link, there was not any explanation about them.
Do these APIs (e.g. tff.federated_mean or tff.federated_sum) modify the output elements of each #CLIENT and bring them to the #SERVER?
Could anyone help me to understand intuitive behind the concept?
A possible rule of thumb about the different function decorators:
tff.tf_computation is for wrapping TF logic. Think "tensors in, tensors out": this should be very similar to the usage of tf.function, where the parameters and return values are tensors, or nested structures of tensors. TFF intrinsics (e.g. tff.federated_mean) cannot be used inside a tff.tf_computation, and tff.tf_computations cannot call tff.federated_computations. The type signature is always on unplaced.
tff.federated_computation should be used to wrap TFF programming abstractions. Think "tensors here, tensors there": Inside this context, a tff.tf_computation can be applied to tff.Values and tff.Values can be communicated to other placements using the intrinsics. The type signature can accept federated types (i.e. types with placements).
For your list of questions:
Both can work on values placed at CLIENTS or SERVER. For example, tff.tf_computation called my_comp can be applied to a value v with type int32#CLIENTS with tff.federated_map(my_comp, v), which will run my_comp on each client.
tff.federated_map() supports applying a computation pointwise (across clients) to data not on the server. You can manipulate the metrics on each client using tff.federated_map. TFF isn't intended for separate options on different clients; the abstractions do not support addressing individuals. You may be able to simulate this in Python, see Operations performed on the communications between the server and clients.
The values of placed data can be inspected in simulation simply by returning them from a tff.Computation, and invoking that computation. The values should be available in the Python environment.
tff.tf_computations should be invokable from anywhere, if there is documentation that says otherwise please point to it. I believe what was intended to highlight is that a tff.federated_computation may invoke a tff.tf_computation, but not vice versa.
The tutorials (Federated Learning for Image Classification and Federated Learning for Text Generation) show examples of printing out the metrics in simulation. You may also be interested in the answer to how to print local outputs in tensorflow federated?
tff.tf_computations should be executed directly if desired. This will avoid any of the federated part of TFF, and simply delegate to TensorFlow. To apply the computation to federated values and use in combination with federated intrinsics, they must be called inside a tff.federated_computation.
Related
I have an OOP solution written in Python which is mostly focused on managing different kind of hardware components such as camera, servo, proximity sensors etc.
What I have is a bunch of operation managers. An operation manager is basically a class which has more than one public methods defined inside of it. The rules I have defined are as follows:
1. Different operation managers can call each other’s public methods
2. Multiple operation managers are involved into one specific use-case
3. Operation manager's method execution depends on the result of the previous operation manager (if previous was successfully executed - execute this one, otherwise terminate)
4. Each operation manager must be able to report its failure to a common channel (logging)
5. There’s no need for a transactional behavior (rollback)
What I am aiming here is to be able to
easily integrate a new operation manager
Be able to test a specific use-case (set of operation manager operations)
Bring a level of abstraction and have the different operation managers decoupled from each other.
I have been looking at CoR but still not sure if it is the best option for me or not.
Nope. Chain of responsibility is useful for step-by-step processing of something, where each component may or may not be involved, or may or may not terminate entire execution. It describes a linear ordering of "steps" and typically implemented in terms of linked list of a "links" - particular objects responsible for processing particular data. HTTP interceptors are classical examples. For non-linear ordering a graph is used and it has little to do with GoF's chain of responsibility: "little" because linked list is a kind of graph by it's nature.
What you've described is too broad to specify a certain pattern. It can be solved with few patterns in place, depending upon code complexity, outer dependencies, number of use cases and many other factors.
Since you are centered around a use case primitive, why don't you define it rigorously in your code? A UseCase accepts whatever it needs and spits out a result of a certain unified form - you'll have to introduce common result/failure-reporting object, general enough to be reused by all use cases.
What I've described is not a pattern, at least not a GoF pattern, though is definitely a good starting point to specialize your requirements and expectations.
I implemented a model in TensorFlow (Python) that I previously programmed in C++ using Eigen, where it worked as expected. But the model is not working as expected in Python, and it's probably because I am defining tensors incorrectly or I am mixing up dimensions.
I am trying to get a feel for the problems by using Visual Studio's (2017) debugger (if a different IDE is better for this then I'm all ears, but I would prefer to stick with VS), but tensors do not evaluate to anything - and I can understand this because the tensor defines an operation and not a data object (well it only produces a data object after calling a session.run).
However, constant and variable tensors - and any other tensors built solely on top of such tensors - come with predefined data. So hey, why not be able to inspect the value through the debugging UI?
So my question: is there a way to inpect the data with some extension?
For example, if I was working in C++ and with Eigen, I can use Eigen.natvis as described here. Anything similar for TensorFlow? It's not just a matter of seeing the evaluated value, either. It would be nice to see things like the shape, etc... while debugging.
I would also be open to other debugging techniques of TensorFlow code, if anyone has a good suggestion.
TensorFlow includes tfdbg, a debugger for TensorFlow models, where you can step through each execution step, check values, stop on NaN, etc. See the programmer's guide TensorFlow Debugger and The Debugger Dashboard for more information.
tfdbg can be a bit cumbersome to setup and use though. A quick alternative to check intermediate values is to use tf.Print operations. TensorFlow includes a few other debugging operations that you may find useful to check for some basic things.
EDIT: Another tool that can be useful is eager execution. This allows you to use TensorFlow operations as if they were regular Python operations (they return the result of the operation instead of the graph object), so it is a good way to check if some particular code does what you expect.
Dask supports defining custom computational graphs as well as opportinistic caching. The question is how can they be used together.
For instance, let's define a very simple computational graph, that computes x+1 operation,
import dask
def compute(x):
graph = {'step1': (sum, [x, 1])}
return dask.get(graph, 'step1')
print('Cache disabled:', compute(1), compute(2))
this yields 2 and 3 as expected.
Now we enable opportunistic caching,
from dask.cache import Cache
cc = Cache(1e9)
cc.register()
print('Cache enabled: ', compute(1), compute(2))
print(cc.cache.data)
we get incorrectly a result of 2 in both cases, because cc.cache.data is {'step1': 2} irrespective of the input.
I imagine this means that the input needs to be hashed (e.g. with dask.base.tokenize and appended to all the keys in the graph. Is there a simpler way of doing it, particularly since the tokenize function is not part of the public API?
The issue is that in complex graphs, a random step name, needs to account for the hash of all the inputs provided to it's children steps, which means that it's necessary to do full graph resolution.
It's important that key names in dask graphs are unique (as you found above). Additionally, we'd like identical computations to have the same key so we can avoid computing them multiple times - this isn't necessary for dask to work though, it just provides some opportunities for optimization.
In dask's internals we make use of dask.base.tokenize to compute a "hash" of the inputs, resulting in deterministic key names. You are free to make use of this function as well. In the issue you linked above we say the function is public, just that the implementation might change (not the signature).
Also note that for many use cases, we recommend using dask.delayed now instead of custom graphs for generating custom computations. This will do the deterministic hashing for you behind the scenes.
With the recent upgrade to version 1.4, Tensorflow included tf.data in the library core.
One "major new feature" described in the version 1.4 release notes is tf.data.Dataset.apply(), which is a "method for
applying custom transformation functions". How is this different from the already existing tf.data.Dataset.map()?
The difference is that map will execute one function on every element of the Dataset separately, whereas apply will execute one function on the whole Dataset at once (such as group_by_window given as example in the documentation).
The argument of apply is a function that takes a Dataset and returns a Dataset when the argument of map is a function that takes one element and returns one transformed element.
Sunreef's answer is absolutely correct. You might still be wondering why we introduced Dataset.apply(), and I thought I'd offer some background.
The tf.data API has a set of core transformations—like Dataset.map() and Dataset.filter()—that are generally useful across a wide range of datasets, unlikely to change, and implemented as methods on the tf.data.Dataset object. In particular, they are subject to the same backwards compatibility guarantees as other core APIs in TensorFlow.
However, the core approach is a bit restrictive. We also want the freedom to experiment with new transformations before adding them to the core, and to allow other library developers to create their own reusable transformations. Therefore, in TensorFlow 1.4 we split out a set of custom transformations that live in tf.contrib.data. The custom transformations include some that have very specific functionality (like tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()). Originally we implemented these custom transformations as functions from Dataset to Dataset, which had an unfortunate effect on the syntactic flow of a pipeline. For example:
dataset = tf.data.TFRecordDataset(...).map(...)
# Method chaining breaks when we apply a custom transformation.
dataset = custom_transformation(dataset, x, y, z)
dataset = dataset.shuffle(...).repeat(...).batch(...)
Since this seemed to be a common pattern, we added Dataset.apply() as a way to chain core and custom transformations in a single pipeline:
dataset = (tf.data.TFRecordDataset(...)
.map(...)
.apply(custom_transformation(x, y, z))
.shuffle(...)
.repeat(...)
.batch(...))
It's a minor feature in the grand scheme of things, but hopefully it helps to make tf.data programs easier to read, and the library easier to extend.
I don't have enough reputation to comment, but I just wanted to point out that you can actually use map to apply to multiple elements in a dataset contrary to #sunreef's comments on his own post.
According to the documentation, map takes as an argument
map_func: A function mapping a nested structure of tensors (having
shapes and types defined by self.output_shapes and self.output_types)
to another nested structure of tensors.
the output_shapes are defined by the dataset and can be modified by using api functions like batch. So, for example, you can do a batch normalization using only dataset.batch and .map with:
dataset = dataset ...
dataset.batch(batch_size)
dataset.map(normalize_fn)
It seems like the primary utility of apply() is when you really want to do a transformation across the entire dataset.
Simply, the arguement of transformation_func of apply() is Dataset; the arguement of map_func of map() is element
I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.
I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/
ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.