Here is the link of the official docs.
https://www.tensorflow.org/versions/r1.3/api_docs/python/tf/colocate_with
It's a context manager to make sure that the operation or tensor you're about to create will be placed on the same device the reference operation is on. Consider this piece of code (tested):
import tensorflow as tf
with tf.device("/cpu:0"):
a = tf.constant(0.0, name="a")
with tf.device("/gpu:0"):
b = tf.constant(0.0, name="b")
with tf.colocate_with(a):
c = tf.constant(0.0, name="c")
d = tf.constant(0.0, name="d")
for operation in tf.get_default_graph().get_operations():
print(operation.name, operation.device)
Outputs:
(u'a', u'/device:CPU:0')
(u'b', u'/device:GPU:0')
(u'c', u'/device:CPU:0')
(u'd', u'/device:GPU:0')
So it places tensor c on the same device where a is, regardless of the active device context of GPU when c is created. This can be very important for multi-GPU training. Imagine if you're not careful and you have a graph with tensors dependent on each other placed on 8 devices randomly. A complete disaster efficiency-wise. tf.colocate_with() can make sure this doesn't happen.
It is not explained in the docs because it's meant to be used by internal libraries only, so no guarantees it will stay. (Very likely it will, however. If you want to know more, you can look it up in the source code as of May 2018; might move as changes to the code happen.)
You're not likely to need this unless you're working on some low-level stuff. Most people use only one GPU, and even if you use multiple, you're generally building your graph one GPU at a time, that is within one tf.device() context manager at a time.
One example where it's used is the tf.train.ExponentialMovingAverage class. Clearly it looks a good idea to make sure to colocate the decay and moving average variables with the value tensor they are tracking.
Related
I read the documentation of both CentralStorageStrategy and MirroredStrategy, but can not understand the essence of difference between them.
In MirroredStrategy:
Each variable in the model is mirrored across all the replicas.
In CentralStorageStrategy:
Variables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs.
Source: https://www.tensorflow.org/guide/distributed_training
What does it mean in practice? What are use cases for the CentralStorageStrategy and how does the training work if variables are placed on the CPU in this strategy?
Consider one particular variable (call it "my_var") in your usual, single-GPU, non-distributed use case (e.g. a weight matrix of a convolutional layer).
If you use 4 GPUs, MirroredStrategy will create 4 variables instead of "my_var" variable, one on each GPU. However each variable will have the same value, because they are always updated in the same way. So the variable updates happen in sync on all the GPUs.
In case of the CentralStorageStrategy, only one variable is created for "my_var", in the host (CPU) memory. The updates only happen in one place.
Which one is better probably depends on the computer's topology and how fast CPU-GPU communication is compared with GPU-GPU. If the GPUs can communicate fast with each other, MirroredStrategy may be more efficient. But I'd benchmark it to be sure.
As I understand, tf.reset_default_graph() only creates a new graph and sets it equal to the default graph. So, the previously created tensors would just be lying around occupying the memory. I have also read the unreferenced tensors are not garbage collected (like normal variables in Python are).
If I am running a cross-validation to search for a set of hyperparameters and thus creating the same graph, again and again, how do I get rid of the previously created tensors?
I had the same problem when designing experiments, after researching about this problem, the only solution that worked for me is this one. As you can read in that link, it seems to be a design flaw and the TF team doesn't seem to care about fixing.
The solution is to create a new process for each cross-validation iteration. So when the process finishes the system kills it and releases the resources automatically.
import multiprocessing
def evaluate(...):
import tensorflow as tf
# Your logic
for ... in cross_valiadtion_loop:
process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()
Is there a way to make a Tensor iterable without running eval() to get its numpy array?
I am trying to iterate through two parts of a tensor after using split() on it, but it happens within the construction of the hidden layers of my neural network, so it needs to happen before I am able to start a session.
import tensorflow as tf
x = tf.placeholder('float', [None, nbits])
layer = [x]
for i in range(1,numbits):
layer.append(tf.add(tf.matmul(weights[i-1], layer[i-1]), biases[i-1]))
aes, bes = tf.split(1, 2, layer[-1])
if i%2 == 1:
for am, a, b in zip(add_layer, aes, bes):
layer.append(am.ex(a, b))
The problem is that layer[-1] is a tf.placeholder at this point, so aes and bes are both tensors, and I can't iterate through them with zip().
Any ideas would be appreciated.
No, there isn't; not directly.
It's easiest to think about Tensorflow programs as being split into two phases: a building Python phase that builds a computation graph, and a execution phase that runs the computation graph. Nothing actually runs during the building phase; all computation happens during the execution phase. The building phase can't depend on the results of the execution phase, except by running the graph (session.run(), .eval(), etc.)
You can't iterate over a Tensor while building the graph, because it doesn't actually get evaluated to a specific set of values until you call session.run(). Instead it's just a reference to a node in the computation graph.
In general, you have to use Tensorflow functions to manipulate Tensors, not Python primitives (like zip). One way I like to think of it is that it's almost like a Tensor is a radioactive object in a sealed box, and you can only handle it indirectly using a robot that can perform a certain set of actions (Tensorflow library functions) :-) So you likely need to find a way to express your task using Tensorflow primitives.
If you gave a complete example of what you're trying to do, it might be possible to say more (it's not clear to me from your code fragment). One possibility might be to use tf.split to split the tensors up into Python lists of subtensors, and then use something like zip on the lists.
I hope that helps!
I have implemented a TensorFlow DNN model (2 hidden layers with elu activation functions trained on MNIST) as a Python class in order to wrap TF calls within another library with its own optimization routines and tools.
When running some tests on a TeslaK20 I noticed that the GPU was being used at 4% of the total capacity. Therefore I looked a bit more closely to the log-device-placement and figured that all critical operations like MatMul, Sum, Add, Mean etc were being assigned to the CPU.
The first thing that came to mind was that it was because I was using dtype=float64, therefore I switched to dtype=float32. While a lot more operations were assigned to GPU, still a good number were assigned to the CPU, like Mean, gradient/Mean_grad/Prod, gradient/Mean.
So here comes my first question (I'm linking a working code example at the end),
1) why would that be? I have written different TF models that consist of simple tensor multiplications and reductions and they run fully on GPU as long as I use single precision.
So here comes the second question,
2) why does TF assign the graph to different devices depending on the data type? I understand that not all kernels are implemented for GPU but I would have thought that things like MatMul could run on GPU for both single and double precision.
3) Could the fact that the model is wrapped within a Python class have an effect? I do not think this is the case because as I said, it did not happen for other models wrapped similarly but that were simpler.
4) What sort of steps can I take to run the model fully on a GPU?
Here is a full working example of my code that I have isolated from my library
https://gist.github.com/smcantab/8ecb679150a327738102 .
If you run it and look at the output you'll see how different parts of the graph have been assigned to different devices. To see how this changes with types and devices change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False the graph fails to initialize.
Any word of advice would be really appreciated.
As Yaroslav noted: Mean, in particular, was not yet implemented for GPU, but it is now available so these operations should run on the GPU with the latest TensorFlow. (as per the DEVICE_GPU registration at that link)
Prior to availability of mean, the status of this was:
(a) You can implement mean by hand, because reduce_sum is available on GPU.
(b) I've re-pinged someone to see if there's an easy way to add the GPU support, but we'll see.
Re float64 on GPU, someone opened an issue three days ago with a patch for supporting float64 reductions on GPU. Currently being reviewed and tested.
No, it doesn't matter if it's wrapped in Python - it's really just about whether a kernel has been defined for it to execute on the GPU or not. In many cases, the answer to "why is X supported on GPU by Y not?" comes down to whether or not there's been demand for Y to run on the GPU. The answer for float64 is simpler: float32 is a lot faster, so in most cases, people work to make their models work in float32 when possible because it gives all-around speed benefits.
Most of the graphics cards like the GTX 980, 1080, etc are stripped of the double precision floating point hardware units. Since these are much cheaper and therefore more ubiquitous than the Newer Tesla Units (which have FP64 double precision hardware), doing double precision calculations on the graphics cards is very slow compared to single precision. FP64 calculations on a GPU seem to be about 32 X slower than FP32 on a GPU without the FP64 hardware. I believe this is why the FP32 calculations tend to be set up for the GPU while FP64 for the CPU (which is faster in most systems.) Hopefully in the future, the frameworks will test the GPU capabilities at runtime to decide where to assign the FP64 calculations.
I'm using theano to do some computation involving a large (about 300,000 x 128) matrix.
The theano function quit after outputting a MemoryError, similar to this question.
I think it's probably because the large matrix gets processed step by step, and each step leaves a large GpuArray (although the same shape as the first one) in the memory.
So my question is:
In the case of temporary(non-output) variables with the same shape and dtype, does theano make any effort to reuse such allocated memory? like, a pool for each array shape exist?
If 1, can I inspect which node in the function graph reuses memory?
Although I know there's shared that can do explicit sharing, but I'm in doubt that using it would make the (already hard) computation code harder to understand.
UPDATE
A simplified example of such situation:
import theano
from theano import tensor as T
a0 = T.matrix() # the initial
# op1 op2 op3 are valid, complicated operations,
# whose output's shape are identical to a0's
a1 = op1(a0)
a2 = op2(a1)
a3 = op3(a2)
f = theano.function([a0], a3)
If none of op1, op2, op3 can be optimized as in-place,
will theano try to reuse memory, e.g. a1 and a3 might "share"
the same address, to reduce memory footprint, since a1 is no
longer used at the time of op3?
Thanks!
You need to use in-place operations as much as possible. For the most part this is not under user control (they are automatically used by the optimizer when circumstances allow) but there are a few things you can do to encourage their use.
Take a look at the documentation on this issue:
memory profiler
Information on the optimizations that enable in-place operation
How to create custom operations that can work in-place
Do not use the inplace parameter of the inc_subtensor operation, as indicated in the documentation. The docs also indicate that inplace operators are not supported for user specification (the optimizer will automatically apply them when possible).
You could use Theano's memory profiler to help track down which operation(s) are using up the memory.
UPDATE
I'm not an expert in this area but I believe Theano uses a garbage collection mechanism to free up memory that is no longer needed. This is discussed in the documentation.