As I understand, tf.reset_default_graph() only creates a new graph and sets it equal to the default graph. So, the previously created tensors would just be lying around occupying the memory. I have also read the unreferenced tensors are not garbage collected (like normal variables in Python are).
If I am running a cross-validation to search for a set of hyperparameters and thus creating the same graph, again and again, how do I get rid of the previously created tensors?
I had the same problem when designing experiments, after researching about this problem, the only solution that worked for me is this one. As you can read in that link, it seems to be a design flaw and the TF team doesn't seem to care about fixing.
The solution is to create a new process for each cross-validation iteration. So when the process finishes the system kills it and releases the resources automatically.
import multiprocessing
def evaluate(...):
import tensorflow as tf
# Your logic
for ... in cross_valiadtion_loop:
process_eval = multiprocessing.Process(target=evaluate, args=(...))
process_eval.start()
process_eval.join()
Related
I am stuck in implementing the Asynchronous Advantage Actor-Critic (A3C) using TensorFlow 2.
Problem Definition:
For A3C implementation, I have to create a bunch of workers (as much as number of CPU cores) and a master. All the workers and also the master will create a copy of a unique CNN module for themselves. The problem arises when each worker has to optimize the master's CNN module and also synchronize its weight with the weights of the master's CNN. I have implemented this by multithreading with no problems, but when multiprocessing comes in, python could serialize neither the weights nor CNN itself to pass between workers and master.
Other's problem:
When I was googling to cope with this problem I have noticed different opinions (Also almost all the Q-A were related to the TF 1). Some people believe that TensorFlow doesn't support multiprocessing so they either moved to PyTorch or just used multithreading. Other people proposed ray library.
First of all, I want to know is it possible to do a multiprocessing approach like A3C by TF 2.
And if it is possible I will appreciate it if someone shares similar works with me.
I'm running into the exact same issue myself. I did find a resource (see link below) that uses TF1.X and multiprocessing for A3C. In general, they use Queues to share the model weights.
Personally, I'm curious is there's an easier or better way to use multiprocessing for A3C. I found it quite hard to replicate their approach, so if you find another method, please share!
https://github.com/hongzimao/a3c/blob/master/train.py
I have some trouble with how tensorflow handle memory. I would like to remove tensors from my memory after each iteration on this toy example.
I am using tensorflow EagerExecution. I have tried with Variables and with simple tensors. tf.assign doesn't do the job. More and more memory is used. I guess it's normal in order to be able to compute the gradient. Even if I apply some dummy optimizer at the end of each iteration, the memory isn't not released (more precisely, it happens sometimes but the global trend is that the memory use is growing).
So is it possible to delete manually ?
import tensorflow as tf
import tensorflow.contrib.eager as tfe
import numpy as np
import time as ti
tf.enable_eager_execution()
for i in range(150):
all_subject=tfe.Variable(np.random.rand(200, 500), dtype=tf.float32)
tf.assign(all_subject, np.random.rand(200,500) )
ti.sleep(1.0)
del all_subject
ti.sleep(0.5)
What the allocation looks like :
Memory profile
According to the documentation on eager execution,
During eager execution the lifetime of state objects is determined by the lifetime of their corresponding Python object.
So you should not observe any memory leak in your code, even without an explicit del: the simple fact of reassigning your variable to something else should free the memory.
However, this is not what happen and I observe the same memory leak as you.
So this could be a (serious) bug, which you could submit here.
I'm using tensorflow 0.10 and I was benchmarking the examples found in the official HowTo on reading data. This HowTo illustrates different methods to move data to tensorflow, using the same MNIST example.
I was surprised by the results and I was wondering if anyone has enough low-level understanding to explain what is happening.
In the HowTo there are basically 3 methods to read in data:
Feeding: building the mini-batch in python and passing it with sess.run(..., feed_dict={x: mini_batch})
Reading from files: use tf operations to open the files and create mini-batches. (Bypass handling data in python.)
Preloaded data: load all the data in either a single tf variable or constant and use tf functions to break that up in mini-batches. The variable or constant is pinned to the cpu, not gpu.
The scripts I used to run my benchmarks are found within tensorflow:
Feeding: examples/tutorials/mnist/fully_connected_feed.py
Reading from files: examples/how_tos/reading_data/convert_to_records.py and examples/how_tos/reading_data/fully_connected_reader.py
Preloaded data (constant): examples/how_tos/reading_data/fully_connected_preloaded.py
Preloaded data (variable): examples/how_tos/reading_data/fully_connected_preloaded_var.py
I ran those scripts unmodified, except for the last two because they crash --for version 0.10 at least-- unless I add an extra sess.run(tf.initialize_local_variables()).
Main Question
The time to execute 100 mini-batches of 100 examples running on a GTX1060:
Feeding: ~0.001 s
Reading from files: ~0.010 s
Preloaded data (constant): ~0.010 s
Preloaded data (variable): ~0.010 s
Those results are quite surprising to me. I would have expected Feeding to be the slowest since it does almost everything in python, while the other methods use lower-level tensorflow/C++ to carry similar operations. It is the complete opposite of what I expected. Does anyone understand what is going on?
Secondary question
I have access to another machine which has a Titan X and older NVidia drivers. The relative results were roughly in line with the above, except for Preloaded data (constant) which was catastrophically slow, taking many seconds for a single mini-batch.
Is this some known issue that performance can vary greatly with hardware/drivers?
Update Oct 9 the slowness comes because the computation runs too fast for Python to pre-empt the computation thread and to schedule the pre-fetching threads. Computation in main thread takes 2ms and apparently that's too little for the pre-fetching thread to grab the GIL. Pre-fetching thread has larger delay and hence can always be pre-empted by computation thread. So the computation thread runs through all of the examples, and then spends most of the time blocked on GIL as some prefetching thread gets scheduled and enqueues a single example. The solution is to increase number of Python threads, increase queue size to fit the entire dataset, start queue runners, and then pause main thread for a couple of seconds to give queue runners to pre-populate the queue.
Old stuff
That's surprisingly slow.
This looks some kind of special cases making the last 3 examples unnecessarily slow (most effort went into optimizing large models like ImageNet, so MNIST didn't get as much attention).
You can diagnose the problems by getting timelines, as described here
Here are 3 of those examples with timeline collection enabled.
Here's the timeline for feed_dict implementation
The important thing to notice is that matmul takes a good chunk of the time, so the reading overhead is not significant
Now here's the timeline for reader implementation
You can see that operation is bottlenecked on QueueDequeueMany which takes whopping 45ms.
If you zoom in, you'll see a bunch of tiny MEMCPY and Cast operations, which is a sign of some op being CPU only (parse_single_example), and the dequeue having to schedule multiple independent CPU->GPU transfers
For the var example below with GPU disabled, I don't see tiny little ops, but QueueDequeueMany still takes over 10ms. The timing seems to scale linearly with batch size, so there's some fundamental slowness there. Filed #4740
Yaroslav nails the problem well. With small models you'll need to speed up the data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to, that reads multiple records in each session.run() call, thereby removing the excess overhead caused by multiple calls.
enqueue_many_size = SOME_ENQUEUE_MANY_SIZE
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
_, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
This was also addressed in this SO question.
The main question is that why the example with the preloaded data (constant)
examples/how_tos/reading_data/fully_connected_preloaded.py is significantly slower than other data loading example codes when using GPU.
I had the same problem, that fully_connected_preloaded.py is unexpectedly slow on my Titan X. The problem was that the whole dataset was pre-loaded on CPU, not GPU.
First, let me share my initial attempts. I applied the following performance tips by Yaroslav.
set capacity=55000 for tf.train.slice_input_producer.(55000 is the size of MNIST training set in my case)
set num_threads=5 for tf.train.batch.
set capacity=500 for tf.train.batch.
put time.sleep(10) after tf.train.start_queue_runners.
However, the average speed per each batch stays the same. I tried timeline visualization for profiling, and still got QueueDequeueManyV2 dominating.
The problem was the line 65 of fully_connected_preloaded.py. The following code loads entire dataset to CPU, still providing a bottleneck for CPU-GPU data transmission.
with tf.device('/cpu:0'):
input_images = tf.constant(data_sets.train.images)
input_labels = tf.constant(data_sets.train.labels)
Hence, I switched the device allocation.
with tf.device('/gpu:0')
Then I got x100 speed-up per each batch.
Note:
This was possible because Titan X has enough memory space to preload entire dataset.
In the original code(fully_connected_preloaded.py), the comment in the line 64 says "rest of pipeline is CPU-only". I am not sure about what this comment intended.
Is there a way to make a Tensor iterable without running eval() to get its numpy array?
I am trying to iterate through two parts of a tensor after using split() on it, but it happens within the construction of the hidden layers of my neural network, so it needs to happen before I am able to start a session.
import tensorflow as tf
x = tf.placeholder('float', [None, nbits])
layer = [x]
for i in range(1,numbits):
layer.append(tf.add(tf.matmul(weights[i-1], layer[i-1]), biases[i-1]))
aes, bes = tf.split(1, 2, layer[-1])
if i%2 == 1:
for am, a, b in zip(add_layer, aes, bes):
layer.append(am.ex(a, b))
The problem is that layer[-1] is a tf.placeholder at this point, so aes and bes are both tensors, and I can't iterate through them with zip().
Any ideas would be appreciated.
No, there isn't; not directly.
It's easiest to think about Tensorflow programs as being split into two phases: a building Python phase that builds a computation graph, and a execution phase that runs the computation graph. Nothing actually runs during the building phase; all computation happens during the execution phase. The building phase can't depend on the results of the execution phase, except by running the graph (session.run(), .eval(), etc.)
You can't iterate over a Tensor while building the graph, because it doesn't actually get evaluated to a specific set of values until you call session.run(). Instead it's just a reference to a node in the computation graph.
In general, you have to use Tensorflow functions to manipulate Tensors, not Python primitives (like zip). One way I like to think of it is that it's almost like a Tensor is a radioactive object in a sealed box, and you can only handle it indirectly using a robot that can perform a certain set of actions (Tensorflow library functions) :-) So you likely need to find a way to express your task using Tensorflow primitives.
If you gave a complete example of what you're trying to do, it might be possible to say more (it's not clear to me from your code fragment). One possibility might be to use tf.split to split the tensors up into Python lists of subtensors, and then use something like zip on the lists.
I hope that helps!
My use-case of TensorFlow requires me to build a new computation graph for each instance that needs to be processed. This ends up blowing up the memory requirements.
Apart from a few tf.Variables that are model parameters, I'd like to delete all other nodes. Other people with similar problems have found tf.reset_default_graph() to be useful, but this would get rid of the model parameters that I need to persist.
What can I use to delete all but these nodes?
Edit:
The instance specific computations actually just means I am adding a lot new operations. I believe these operations are the reason behind the memory issues.
UPDATE:
See the recently released tensorflow fold (https://github.com/tensorflow/fold) which allows dynamic construction of computation graphs.
The tf.graph data-structure is designed to be an append-only data-structure. It is therefore not possible to remove or modify existing nodes. Usually this is not a problem, as only the necessary subgraph is processed when running a session.
What you can try is to copy the Variabels of your graph into a new graph and delete the old one. To archive this just run:
old_graph = tf.get_default_graph() # Save the old graph for later iteration
new_graph = tf.graph() # Create an empty graph
new_graph.set_default() # Makes the new graph default
If you want to iterate over all nodes in the old graph use:
for node in old_graph.get_operations():
if node.type == 'Variable':
# read value of variable and copy it into new Graph
Alternatively you can use:
for node in old_graph.get_collection('trainable_variables'):
# iterates over all trainable Variabels
# read and create new variable
Have also a look at python/framework/ops.py : 1759 to see more ways on manipulating nodes in graph.
However before you mess around with tf.Graph I would strongly recommend to consider whether this is really required. Usually one can try to generalize the computation and use shared variables build a graph, so that each instance you want to process is a subgraph of this graph.