Is it possible to reuse tensors in multiple tf-graphs, even after they are reset?
Problem:
I have a large dataset that I want to evaluate with many different tf-graphs.
For each evaluation, tensorflow is reset with tf.compat.v1.reset_default_graph() and initialized completely from scratch.
Imho, it seems kind of dull and slow to call the data-to-tensor procedure every time, so I thought I could just define the data-tensor once and use it for all future evaluation.
Unfortunately, reusing tensors does not seem to be possible, as 'Tensor must be from the same graph as Tensor'.
ValueError: Tensor("Const:0", shape=(1670,), dtype=float32, device=/device:GPU:0) must be from the same graph as Tensor("Const_1:0", shape=(1670,), dtype=float32).
Is it possible to reuse these tensors somehow?
Check out this answer in another answered on another questio. https://stackoverflow.com/a/42616834/13514201
TensorFlow stores all operations on an operational graph. This graph defines what functions output to where, and it links it all together so that it can follow the steps you have set up in the graph to produce your final output. If you try to input a Tensor or operation on one graph into a Tensor or operation on another graph it will fail. Everything must be on the same execution graph.
Try removing with tf.Graph().as_default():
Related
I am an intermediate learner in PyTorch and in some recent cases, I have seen people use the torch.inference_mode() instead of the famous torch.no_grad() while validating your trained agent in reinforcement learning (RL) experiments. I checked the documentation and they have a table that consists of two flags to disable the gradient computation. And to be honest, if I read the description it sounds exactly the same to me. Has someone figured out an explanation?
So I have been scraping the web for a few days and I think I got my explanation. The torch.inference() mode has been added as an even more optimized way of doing inference with PyTorch (versus the torch.no_grad()). I listened to the PyTorch podcast and they have an explanation as to why there exists to different flags.
Version control of tensors: Let's say, you have a code in PyTorch and you have used it to train an agent. When you do torch.no_grad() and just run inference on the trained model, there are still some functionalities of PyTorch like version counting of tensor which are still in play, which gets allocated every time a tensor is created and increments (version bumps) when you mutate that specific tensor. Keeping a check of all the versions of all the tensors requires extra cost from computation and we can't just get rid of them as we have to keep an eye out for tensor mutations, either (directly) to that specific tensor or (indirectly) aliasing to some other tensor which is saved for backward computation.
View Tracking of Tensor: Pytorch tensors are strided. What that means is PyTorch uses stride in the backend for indexing, which can be used if you want to directly access specific elements in the memory block. But in the case of torch.autograd, what if you took a tensor and created a new view, and mutated it with a tensor that is associated with the backward computation? With torch.no_grad they keep record of some view metadata which is required to keep track of which tensors require gradients and which not. This also add up an extra overhead to you computation resources.
So torch.autograd check for these changes which don't get tracked when you switch to torch.inference_mode() (instead of torch.no_grad()) and if you code is not exploiting the above two points then inference mode works and reduces the code execution time. (PyTorch dev team says they have seen a bump of 5-10% while deploying models in production at Facebook.)
Keras layers can be reused i.e. if I have l = keras.layers.Dense(5) I can apply it multiple times to different tensors like t1 = l(t1); t2 = l(t2).
Is there anything similar in tensorflow without using keras?
Why do I need it. I have non-eager mode and want to create static .pb graph-file. Suppose I have a function f(t) that is huge and long, and it does tensor t transformations. Inside a graph it creates a huge sub-graph of different operations with flow of tensors over paths. Now I want to reuse it, meaning that I don't want to call it for every input t because it will form new sub-graph each time, just duplicates with different inputs. I want somehow to reuse same subgraph and directing different tensors as inputs to this subgraph. Also it is good to reuse it not to call huge function to form same structure for every possible input tensor, because it is slow.
Another important reason for re-using same operation is because same weights and heavy parameters can be used for many calls of operation on many inputs. It is sometimes important and needed that weights are same for all inputs to have correctly trained neural network.
The real reason for reusing is not only to save sapce occupied by graph, but also due to the fact that number of possible inputs to f(t) may vary depending on input. Suppose we have keras.layers.Input(...) placeholder as input. It always has batch 0-th dimension equal to None (unknown) at graph construction time, the real value for 0-th dimension is only known when real data is fed through sess.run(...). Now when data is fed I want to make as many transformations (calls to f(t)) as the size of batch dimension, in other words I want to call f(t) for every sub-tensor in the batch. E.g. for batch of images I want to call f(t) for every single image in the batch. Hence there will be different number of calls of f(t) for different batch sizes. How do I achieve this? Could it be achieved through tf.while_loop, if yes than how do I use while loop in my case?
I'm a newbie with PyTorch and adversarial networks. I've tried to look for an answer on the PyTorch documentation and from previous discussions both in the PyTorch and StackOverflow forums, but I couldn't find anything useful.
I'm trying to train a GAN with a Generator and a Discriminator, but I cannot understand if the whole process is working or not. As far as I'm concerned, I should train the Generator first and, then, updating the Discriminator's weights (similarly as this). My code for updating the weights of both models is:
# computing loss_g and loss_d...
optim_g.zero_grad()
loss_g.backward()
optim_g.step()
optim_d.zero_grad()
loss_d.backward()
optim_d.step()
where loss_g is the generator loss, loss_d is the discriminator loss, optim_g is the optimizer referring to the generator's parameters and optim_d is the discriminator optimizer.
If I run the code like this, I get an error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
So I specify loss_g.backward(retain_graph=True), and here comes my doubt: why should I specify retain_graph=True if there are two networks with two different graphs? Am I getting something wrong?
Having two different networks doesn't necessarily mean that the computational graph is different. The computational graph only tracks the operations that were performed from the input to the output and it doesn't matter where the operation takes place. In other words, if you use the output of the first model in the second model (e.g. model2(model1(input))), you have the same sequential operations as if they were part of the same model. In fact, that is no different from having different parts of the model, such as multiple convolutions, that you apply one after the other.
The error you get, indicates that you are trying to backpropagate from the discriminator through the generator, which would mean that the discriminator's output directly adapts the generator's parameters for the discriminator to be successful. In an adversarial setting that is precisely what you want to avoid, they should be independent from each other. By setting retrain_graph=True you incorrectly hide this bug. In nearly all cases retain_graph=True is not the solution and should be avoided.
To resolve that issue, the two models need to be made independent from each other. The crossover between the two models happens when you use the generators output for the discriminator, since it should decide whether that was real or fake. Something along these lines:
fake = generator(noise)
real_prediction = discriminator(real)
# Using the output of the generator, continues the graph.
fake_prediction = discriminator(fake)
Even though fake comes from the generator, as far as the discriminator is concerned, it's merely another input, just like real. Therefore fake should be treated the same as real, where it is not attached to any computational graph. That can easily be done with torch.Tensor.detach, which decouples the tensor from the graph.
fake = generator(noise)
real_prediction = discriminator(real)
# Detach to make it independent of the generator
fake_prediction = discriminator(fake.detach())
That is also done in the code you referenced, from erikqu/EnhanceNet-PyTorch - train.py:
hr_imgs = torch.cat([discriminator(hr), discriminator(generated_hr.detach())], dim=0)
Context: I have two Tensorflow graphs that ideally should produce the same result for image segmentation. (The model comes from here). The first graph is the "original" graph, while the second graph is a simplified version after running the toco tool on the first graph and setting the input to a fixed size (in this case, 1,572,572,1).
The command I used was bazel run //tensorflow/lite/toco:toco – --drop_control_dependency --input_file=$MODEL_DIR/unet.pb --output_file=$MODEL_DIR/unet_bn.pb --input_format=TENSORFLOW_GRAPHDEF --output_format=TENSORFLOW_GRAPHDEF --input_shape=1,572,572,1 --input_array=x --output_array=output_map/Relu
Unfortunately, toco does not yet seem to support the Exponential operator, so I have run both graphs with the same input up to the same point partway through the graph. The two graphs do not produce the same results. The difference starts after the first convolution.
I noticed that the original model has a Conv2D operation whereas the
simplified model uses DepthwiseConv2dNative.
Upon examining the Tensorflow toco source code, it appears that one of the graph transforms it performs is converting "pure" convolutions to depthwise convolutions as seen here, and one of the conditions for doing so is if the input shape has 1 feature channel (i.e. input_array.shape().dims(3) == 1), which is indeed the case for the first convolution in the model, but not for subsequent convolutions.
So I see why this conversion is taking place, but after this conversion, the two graphs do not produce the same result! Is there an explanation as to why this is happening?
Upon further investigation, it appears that the simplified results are "mostly correct".
By that, I mean that directly comparing both graph's output tensors by using == will evaluate to false because the two tensors are not identically equal.
However, if I permit a tolerance level, in this case, 1e-04 as such:
print(np.allclose(reference_results, simplified_results, rtol=1e-04), it evaluates to true.
So the error could perhaps be due to floating point accumulation or some other small discrepancy, but I believe it does not suggest some larger systematic error.
Suppose I have two placeholder quantities in tensorflow: placeholder_1 and placeholder_2. Essentially I would like the following computational functionality: "if placeholder_1 is defined (ie is given a value in the feed_dict of sess.run()), compute X as f(placeholder_1), otherwise, compute X as g(placeholder_2)." Think of X as being a hidden layer in a neural network that can optionally be computed in these two different ways. Eventually I would use X to produce an output, and I'd like to backpropagate error to the parameters of f or g depending on which placeholder I used.
One could accomplish this using the tf.where(condition, x, y) function if there was a way to make the condition "placeholder_1 has a value", but after looking through the tensorflow documentation on booleans and asserts I couldn't find anything that looked applicable.
Any ideas? I have a vague idea of how I could accomplish this basically by copying part of the network, sharing parameters and syncing the networks after updates, but I'm hoping for a cleaner way to do it.
You can create a third placeholder variable of type boolean to select which branch to use and feed that in at run time.
The logic behind it is that since you are feeding in the placholders at runtime anyways you can determine outside of tensorflow which placeholders will be fed.