RaggedTensor on TPU

RaggedTensor on TPU - python

I am trying to train a neural network using TensorFlow that takes as input (tf.keras.layers.Input) a RaggedTensor. It works fine on CPU and GPU but I am really strugling to make it work on TPU. I wanted to know if some of you managed to make it work (not necessarily looking for a direct solution though it'd be great, some tooltips would already be great!). So far the error messages were explicit enough for me to go on, but I now struggle on how to go a bit further.
What I did so far:
I am using tf.data.Dataset to read data from TF_Records but I needed to explicitly transform it into a DistributedDataset to disable prefecthing.
strategy.experimental_distribute_dataset(
dataset,
tf.distribute.InputOptions(
experimental_prefetch_to_device=False
)
)
I got Compilation failure: Detected unsupported operations when trying to compile graph ... on XLA_TPU_JIT: RaggedTensorToTensor which could be (sort-of) fixed by allowing soft device placement:
tf.config.set_soft_device_placement(True)
I now get stuck with Compilation failure: Input 1 to node '.../RaggedReduceSum/RaggedReduce/RaggedSplitsToSegmentIds/Repeat/SequenceMask/Range' with op Range must be a compile-time constant.. I totally understand why I have this error, I am fully aware of available ops on TPU and in particular that most of the dynamic operations should be determined at compile-time to run on TPU. But I can't think how I could use those ragged-tensors with TPU then...
Any idea would be appreciated :)
P.S.: I haven't seen much news from the TensorFlow team on RaggedTensors on TPU since this answer back in July 2020, but I may have missed a bunch about it... Pointing to the git threads would already be great for me if I can investigate more.

Related

Tensorflow - executing a model in production as efficiently as possible

I have a semantic-segmentation model created using Keras.
Now I want to use it in production where I need to execute the model on a large folder with 10k-100k images a few times a day. This takes several hours, so every improvement is helpful.
I'm wondering what is the correct way to use it in production. I currently just use model.predict() on a created Sequence. But everywhere I look I see all kinds of different libraries or technologies that seem relevant.
Tensorflow-serving, converting to C, different libraries by intel and others.
I'm wondering what is the bottom-line recommended way to execute a model as production-grade and as efficiently as possible.

I'm not sure this has a canonical answer — as with many things there are lots of tradeoffs between different choices — but I'll attempt to give an answer.
I've been pretty happy using TensorFlow Serving to do model deployment, with separate services doing the business logic calling those models and doing something with the predictions. That provides a small boost because there won't be as much resource contention — the TensorFlow Serving instances do nothing but run the models. We have them deployed via Kubernetes, and that makes grouping a cluster of TensorFlow Serving instances very easy if you want to scale horizontally as well for more throughput.
You're unlikely to get meaningful improvements by messing around the edges with things like making sure the TensorFlow Serving deployment is compiled with the right flags to use all of Intel's vector instructions. The big boost is running everything in fast C++ code. The one (probably very obvious) way to boost performance is to run the inference on a GPU rather than a CPU. That's going to scale more or less the way you'd expect: the more powerful the GPU, the faster the inference is going to be.
There are probably more involved things you could do to eke our more single percentage point gains. But this strikes a pretty good balance of speed with flexibility. It's definitely a little bit more finicky to have this separate service architecture: if you're not doing something too complicated, it might be easier (if quite a bit slower) to use your models "as-is" in a Python script rather than going to the trouble of setting up TensorFlow serving. On the other hand, the speedup is pretty significant, and it's pretty easy to manage. On the other end of the spectrum, I have no idea what crazy things you could do to eke out more marginal performance gains, but instinct tells me that they're going to be pretty exotic, and therefore pretty difficult to maintain.

It is difficult to answer, but I will consider the following orthogonal aspects
Is it possible for me to run a model at a lower resolution? If so, resizing an image before running the model -- this should give you X**2 times of speed up, where X is the downsampling factor that you use.
Production models are often executed remotely. So understanding your remote machine config is very important. If you only have CPU-only machines, options like OpenVINO typically give more speed-up than the native tensorflow. If you have GPU machines, options like tensorRT can also help you. The actual speed-up is very difficult to estimate, but I would say at least 2x faster.
Uploading/downloading a JPEG image instead of PNG or BMP. This should largely reduce your communication time.

Inspecting the values of constant or variable tensors during debug

I implemented a model in TensorFlow (Python) that I previously programmed in C++ using Eigen, where it worked as expected. But the model is not working as expected in Python, and it's probably because I am defining tensors incorrectly or I am mixing up dimensions.
I am trying to get a feel for the problems by using Visual Studio's (2017) debugger (if a different IDE is better for this then I'm all ears, but I would prefer to stick with VS), but tensors do not evaluate to anything - and I can understand this because the tensor defines an operation and not a data object (well it only produces a data object after calling a session.run).
However, constant and variable tensors - and any other tensors built solely on top of such tensors - come with predefined data. So hey, why not be able to inspect the value through the debugging UI?
So my question: is there a way to inpect the data with some extension?
For example, if I was working in C++ and with Eigen, I can use Eigen.natvis as described here. Anything similar for TensorFlow? It's not just a matter of seeing the evaluated value, either. It would be nice to see things like the shape, etc... while debugging.
I would also be open to other debugging techniques of TensorFlow code, if anyone has a good suggestion.

TensorFlow includes tfdbg, a debugger for TensorFlow models, where you can step through each execution step, check values, stop on NaN, etc. See the programmer's guide TensorFlow Debugger and The Debugger Dashboard for more information.
tfdbg can be a bit cumbersome to setup and use though. A quick alternative to check intermediate values is to use tf.Print operations. TensorFlow includes a few other debugging operations that you may find useful to check for some basic things.
EDIT: Another tool that can be useful is eager execution. This allows you to use TensorFlow operations as if they were regular Python operations (they return the result of the operation instead of the graph object), so it is a good way to check if some particular code does what you expect.

which higher layer abstraction to use for tensorflow

I am looking for higher layer abstractions for my deep learning project.
Few doubts lately.
I am really confused about which is more actively maintained tflearn(docs), or tensorflow.contrib.learn. But projects are different and actively contributed on Github. I did not find why are people working this way, same goal, same name, but working differently.
That was not enough, we also have skflow, why do we have this project separately, this aims to mimic scikit-learn like functionality for deep learning(just like tflearn do).
There are more and more coming, which one choose, and which one will be maintained in future?
Any ideas?
PS: I know this might get closed. but I would definitely want some answers first. Those wanting it closed, please care to drop a reason/hint/link in comments

What about keras (https://keras.io/)? It is easy to use. However you can do pretty much everything you want with it. It uses either theano or tensorflow as its backend. Kaggle contests are often solved using keras (e.g. https://github.com/EdwardTyantov/ultrasound-nerve-segmentation).
Edit:
Because you did not specify python I would also recommend matconvnet if you look for more abstraction.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

PyCUDA/CUDA: Causes of non-deterministic launch failures?

Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance)
Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation algorithm for a particular system.
On my dev machine (Geforce 9800GT, running under CUDA 4.0) this works perfectly, all the time, no matter what I throw at it (up to a computational limit based on the stated exponential nature), but on a test machine (4xTesla S1070's, only one used, under CUDA 3.1) the exact same code (Python base, PyCUDA interface to CUDA kernels), produces the exact results for 'small' cases, but in mid-range cases, the solving stage fails on random iterations.
Previous problems I've had with this code have been to do with the numeric instability of the problem, and have been deterministic in nature (i.e fails at exactly the same stage every time), but this one is frankly pissing me off, as it will fail whenever it wants to.
As such, I don't have a reliable way to breaking the CUDA code out from the Python framework and doing proper debugging, and PyCUDA's debugger support is questionable to say the least.
I've checked the usual things like pre-kernel-invocation checking of free memory on the device, and occupation calculations say that the grid and block allocations are fine. I'm not doing any crazy 4.0 specific stuff, I'm freeing everything I allocate on the device at each iteration and I've fixed all the data types as being floats.
TL;DR, Has anyone come across any gotchas regarding CUDA 3.1 that I haven't seen in the release notes, or any issues with PyCUDA's autoinit memory management environment that would cause intermittent launch failures on repeated invocations?

Have you tried:
cuda-memcheck python yourapp.py
You likely have an out of bounds memory access.

You can use nVidia CUDA Profiler and see what gets executed before the failure.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.