I have the following lines as part of a program:
tensor_gradients = optimizer.compute_gradients(cross_entropy)
with tf.Session() as session:
for step in range(20000):
batch = mnist.train.next_batch(train_batch_size)
feed = {input_x: batch[0], input_y: batch[1]}
gradients = session.run([tensor_gradients], feed)[0]
for i in range(len(gradients)):
gradients[i] = (gradients[i][0], tensor_gradients[i][1])
... computation on gradients ...
training_step = optimizer.apply_gradients(gradients)
training = session.run([training_step], feed)
The reason I'm doing this is because I want to modify the gradients using numpy. The above code runs out of memory around step 800. However, if you replace the optimizer.apply_gradients step by tensor_gradients, then the code does not run out of memory.
training_step = optimizer.apply_gradients(tensor_gradients)
Any ideas at what might be happening? The rest of the code remains the same except for the line above. Is it possible that the numpy arrays in gradients is not being garbage collected because they are being passed into the apply_gradients step? I have no idea where the memory leak could be or if I'm inadvertently adding to the tensorflow graph by passing modified gradients (in numpy array form) back into apply_gradients.
Any ideas at what might be happening?
OOM happens because you're constructing the graph inside the loop: This builds a graph with 20,000x nodes, and running it may need more memory than you have.
Move all TF operations that build the graph outside the loop, i.e. everything except feed_dict construction and sess.run calls.
Reply to comments
Apply gradients builds the graph?
Yes, if you look in the docs:
Returns:
An `Operation` that applies the specified gradients. If `global_step`
was not None, that operation also increments `global_step`.
Related
What is the new approach (under eager execution) to feeding data through a dataset pipeline in a dynamic fashion, when we need to feed it sample by sample?
I have a tf.data.Dataset which performs some preprocessing steps and reads data from a generator, drawing from a large dataset during training.
Let's say that dataset is represented as:
ds = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_one_shot_iterator(ds)
After training I want to produce various visualizations which require that I feed one sample at a time through the network for inference. I've now got this dataset preprocessing pipeline that I need to feed my raw sample through to be sized and shaped appropriately for the network input.
This seems like a use case for the initializable iterator:
placeholder = tf.placeholder(tf.float32, shape=None)
ds = tf.data.Dataset.from_tensor_slices(placeholder)
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_initializable_iterator(ds)
# now re-initialize for each sample
Keep in mind that the map operation in this example represents a long sequence of preprocessing operations that can't be duplicated for each new data sample being feed in.
This doesn't work with eager execution, you can't use the placeholder. The documentation examples all seem to assume a static input such as in the first example here.
The only way I can think of doing this is with a queue and tf.data.Dataset.from_generator(...) which reads from the queue that I push to before predicting on the data. But this feels both hacky, and appears prone to deadlocks that I've yet to solve.
TF 1.14.0
I just realized that the answer to this question is trivial:
Just create a new dataset!
In non-eager mode the code below would have degraded in performance because each dataset operation would have been added to the graph and never released, and in non-eager mode we have the initializable iterator to resolve that issue.
However, in eager execution mode tensorflow operations like this are ephemeral, added iterators aren't being added to a global graph, they just get created and die when no longer referenced. Win one for TF2.0!
The code below (copy/paste runnable) demonstrates:
import tensorflow as tf
import numpy as np
import time
tf.enable_eager_execution()
inp = np.ones(shape=5000, dtype=np.float32)
t = time.time()
while True:
ds = tf.data.Dataset.from_tensors(inp).batch(1)
val = next(iter(ds))
assert np.all(np.squeeze(val, axis=0) == inp)
print('Processing time {:.2f}'.format(time.time() - t))
t = time.time()
The motivation for the question came on the heels of this issue in 1.14 where creating multiple dataset operations in graph mode under Keras constitutes a memory leak.
https://github.com/tensorflow/tensorflow/issues/30448
I'm using tf.Session in tensorboard so I can't enable Eager mode.
I need to extract image patches in large image by using tf.image.extract_image_patches. So in my custom training generator, I add something like:
While True:
num_patches = tf.image.extract_image_patches(input_big_pic, ksizes, strides, rates, patch_padding)
With tf.Session as sess:
inputs_after_tensor1 = sess.run(num_patches )
.....some modifies for this ndarray "inputs_after_tensor1".......
yield ({'input1': np.array(result_inputs_after_tensor1)})
My loss function is reducing but my output images on tensorflow are not changing so I wonder does the tf.Session in my training_data_generator affect the fit_generator?
I tried and I think it's ok to run tf.session inside but it slows the training procedure.
I am trying to run some regression models on GPU. While I get a very low GPU utilization upto 20%. After going through the code,
for i in range(epochs):
rand_index = np.random.choice(args.train_pr,
size=args.batch_size)
rand_x = X_train[rand_index]
rand_y = Y_train[rand_index]
I use these three lines for selecting a random batch for each iteration. So, I wanted to ask when the training is going on, can I ready up one more batch for the next iteration?
I am working on a regression problem and not a classification problem. I have already seen threading in Tensorflow but found the examples only for images and there's no example for a big matrix of size 100000X1000 which is used for training.
You have a large numpy array that lies on the host memory. You want to be able to process it in parallel on the CPU and send batches to the device.
Since TF 1.4, the best way to do it is to use tf.data.Dataset, and particularly tf.data.Dataset.from_tensor_slices. However, as the documentation points out, you should probably not provide your numpy arrays as arguments to this function, because it will end up being copied to device memory. What you should do instead is to use placeholders. The example given in the doc is pretty self-explanatory:
features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)
dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
iterator = dataset.make_initializable_iterator()
sess.run(iterator.initializer, feed_dict={features_placeholder: features,
labels_placeholder: labels})
Further preprocessing or data augmentation can be applied to the slices using the .map method. To make sure that those operations happen concurrently, make sure to use tensorflow operations only and avoid wrapping python operations with tf.py_func.
This is a good use case for generators. You can set up a generator function to yield slices of your numpy matrices one chunk at a time. If you use a package like Keras, you can supply the generator directly to the train_on_batch function. If you prefer to use Tensorflow directly, you can use:
sess = tf.Session()
sess.run(init)
batch_gen = generator(data)
batch = batch_gen.next()
sess.run([optimizer, loss, ...], feed_dict = {X: batch[0], y: batch[1]})
Note: I am using placeholders for the optimizer and loss, you have to replace with your definitions. Note that your generator should yield a (x, y) tuple. If you are unfamiliar with generator expressions, there are many examples online, but here is a simple example from the Keras documentation that shows how you can read in numpy matrices from a file in batches:
def generate_arrays_from_file(path):
while 1:
f = open(path)
for line in f:
x, y = process_line(line)
yield (x, y)
f.close()
But also more fundamentally, a low GPU-usage is not really indicative of any problem loading batches, but rather that your batch size might be too small.
Today, I got a really weird thing.
I load a caffe model, feed input, net.forward, check the output data, perfect.
Then, I feed labels to the bottom layer blobs.diff, net.backward, then check the gradients (params.diff) with the result from same model caffe c++ program. They were different.
Further, when I continued to run net.backward several times at python, each time I got different gradients. This is not the case for C++ programs, they keep the same no matter how many time you run net.backward, as long as you did not change the bottom diff.
I check the bottom layer's blobs and diff, they kept unchanged both in python and C++ programs, and weights were also unchanged. This was really weird.
Anyone can provide some hints? I can provide codes if it is necessary.
Here is part of the codes :
def train_one_step(X, y, lr) :
net.blobs['data'].data[...] = X
#Forward, to get the softmax output
output = net.forward()
prob = output['prob']
#Calculate the loss of cross entropy loss function
net.blobs['prob'].diff[:] = y[:] - prob[:]
#Calculate the gradients of net parameter
net.backward()
#Renew weights based on gradients and learning rate
for key in net.params:
net.params[key][0].data[:] += lr * net.params[key][0].diff[:]
net.params[key][1].data[:] += lr * net.params[key][1].diff[:]
return loss, prob
I just want to dig out my own step function (the step of solver), so I can make some trick on the loss before it backwards, and something else. I know this is quite low efficient, data between GPU, CPU exchanged a lot.
In order to test it, I kept input the same sample(same X, y), you get different diff data. That means this function cannot work.
I'm using TensorFlow to build a deep learning model. (The entire model is very complicated.) In the model, I need to use while_loop to dynamically control the computation flow based on my input sentences number. Previously, I used for loop instead of while_loop. After I switched to while_loop, the gradient doesn't work any more.
By the gradient not working I mean that if I execute forward, it works fine (produces some output). But if I enable gradients computation for training, it doesn't produce any response when I run my code, just hangs there. In top, it shows as S (suspend).
Anyone have any idea what is going on?
Below is how I use while_loop, in a very standard way:
def body(argmax_ep_gate, h, mem_state_previous, dummy):
'''doing some computation'''
return tf.to_int32(argmax_ep_gate), h, mem_state_current, mem_state_previous
def condition(argmax_ep_gate, h, mem_state_previous, dummy):
'''return some condition in bool'''
argmax_g, h, _, state = tf.while_loop(
condition, body, [initial_argmax_g, initial_h, self.state, self.state])
refer to TensorFlow stuck into endless loop using tf.while_loop(). note that if the body contains trainable variables, you need use variable scope