tensorflow - memory leak?

tensorflow - memory leak? - python

I'm running tensorflow 0.10.0rc0 on OSX 10.9.5 Mavericks.
There are approximately 25k training examples, 250 features (x), 15 classes (y_) and the predict (y) is a single-hidden-layer NN perceptron.
The following snippet of a simple training loop seems to have a massive memory leak (of order 10s of GBs over =~ 200 iterations - brings down my MBP :( ) :
import tensorflow as tf
# Initialize placeholders and variables etc...
...
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(y,y_))
train_step = tf.train.GradientDescentOptimizer(lrate).minimize(cost)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for i in range(niter):
# Train
_,c=sess.run([train_step,cost])
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(y_,1))
sess.run(correct_prediction)
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
print sess.run(accuracy)
# EDIT: Calculate test error
ytest=sess.run(y[itrain:itrain+itest,:])
ytest_=sess.run(y_[itrain:itrain+itest,:])
test_prediction = tf.equal(tf.argmax(ytest,1), tf.argmax(ytest_,1))
test_accuracy=tf.reduce_mean(tf.cast(test_prediction,tf.float32))
print sess.run(test_accuracy)
sess.close()
Am I doing something obviously wrong, or is this per chance a bug? Thanks!
PS: If this is fixed in a later tensorflow build, note that bazel requires Yosemite or higher, so I can't generate my own .whl file (AFAIK) from source; is a nightly whl available? I would rather not be forced into an OS upgrade right now.

Its unnecessary to run sess.run(correct_prediction) -- it's a tensorflow graph variable on which the accuracy variable is dependant. This implies that it will be evaluated during the call to sess.run(accuracy) in any case.
You're probably modifying your graph by creating new correct_prediction and accuracy variables on each iteration. This is also unnecessary -- they can be moved outside the loop and simply evaluated each time with calls to sess.run. So your inner loop will be something like
for i in range(niter):
# Train
_, c = sess.run([train_step, cost])
print sess.run(accuracy)

Related

Structuring a Keras project to achieve reproducible results in GPU

I am writing a tensorflow.Keras wrapper to perform ML experiments.
I need my framework to be able to perform an experiment as specified in a configuration yaml file and run in parallel in a GPU.
Then I need a guarantee that if I ran the experiment again I would get if not the exact same results something reasonably close.
To try to ensure this, my training script contains these lines at the beginning, following the guidelines in the official documentation:
# Set up random seeds
random.seed(seed)
np.random.seed(seed)
tf.set_random_seed(seed)
This has proven to not be enough.
I ran the same configuration 4 times, and plotted the results:
As you can see, results vary a lot between runs.
How can I set up a training session in Keras to ensure I get reasonably similar results when training in a GPU? Is this even possible?
The full training script can be found here.
Some of my colleagues are using just pure TF, and their results seem far more consistent. What is more, they do not seem to be seeding any randomness except to ensure that the train and validation split is always the same.

Keras + Tensorflow.
Step 1, disable GPU.
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
Step 2, seed those libraries which are included in your code, say "tensorflow, numpy, random".
import tensorflow as tf
import numpy as np
import random as rn
sd = 1 # Here sd means seed.
np.random.seed(sd)
rn.seed(sd)
os.environ['PYTHONHASHSEED']=str(sd)
from keras import backend as K
config = tf.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1)
tf.set_random_seed(sd)
sess = tf.Session(graph=tf.get_default_graph(), config=config)
K.set_session(sess)
Make sure these two pieces of code are included at the start of your code, then the result will be reproducible.

Try adding seed parameters to weights/biases initializers. Just to add more specifics to Alexander Ejbekov's comment.
Tensorflow has two random seeds graph level and op level. If you're using more than one graph, you need to specify seed in every one. You can override graph level seed with op level, by setting seed parameter within function. And you can make two functions even from different graphs output same value if same seed is set.
Consider this example:
g1 = tf.Graph()
with g1.as_default():
tf.set_random_seed(1)
a = tf.get_variable('a', shape=(1,), initializer=tf.keras.initializers.glorot_normal())
b = tf.get_variable('b', shape=(1,), initializer=tf.keras.initializers.glorot_normal(seed=2))
with tf.Session(graph=g1) as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(a))
print(sess.run(b))
g2 = tf.Graph()
with g2.as_default():
a1 = tf.get_variable('a1', shape=(1,), initializer=tf.keras.initializers.glorot_normal(seed=1))
with tf.Session(graph=g2) as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(a1))
In this example, output of a is the same as a1, but b is different.

Warning `tried to deallocate nullptr` when using tensorflow eager execution with tf.keras

As per the tensorflow team suggestion, I'm getting used to tensorflow's eager execution with tf.keras. However, whenever I train a model, I receive a warning (EDIT: actually, I receive this warning repeated many times, more than once per training step, flooding my standard output):
E tensorflow/core/common_runtime/bfc_allocator.cc:373] tried to deallocate nullptr
The warning doesn't seem to affect the quality of the training but I wonder what it means and if it is possible to get rid of it.
I use a conda virtual environment with python 3.7 and tensorflow 1.12 running on a CPU. (EDIT: a test with python 3.6 gives the same results.) A minimal code that reproduces the warnings follows. Interestingly, it is possible to comment the line tf.enable_eager_execution() and see that the warnings disappear.
import numpy as np
import tensorflow as tf
tf.enable_eager_execution()
N_EPOCHS = 50
N_TRN = 10000
N_VLD = 1000
# the label is positive if the input is a number larger than 0.5
# a little noise is added, just for fun
x_trn = np.random.random(N_TRN)
x_vld = np.random.random(N_VLD)
y_trn = ((x_trn + np.random.random(N_TRN) * 0.02) > 0.5).astype(float)
y_vld = ((x_vld + np.random.random(N_VLD) * 0.02) > 0.5).astype(float)
# a simple logistic regression
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(1, input_dim=1))
model.add(tf.keras.layers.Activation('sigmoid'))
model.compile(
optimizer=tf.train.AdamOptimizer(),
# optimizer=tf.keras.optimizers.Adam(), # doesn't work at all with tf eager execution
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train model on dataset
model.fit(
x_trn, y_trn,
epochs=N_EPOCHS,
validation_data=(x_vld, y_vld),
)
model.summary()

Quick solutions:
It did not appear when I ran the same script in TF 1.11 while the optimization was performed to reach the same final validation accuracy on a synthetic dataset.
OR
Suppress the errors/warning using the native os module (adapted from https://stackoverflow.com/a/38645250/2374160). ie; by setting the Tensorflow logging environment variable to not show any error messages.
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
More info:
Solving this error in the correct way may require familiarity with MKL library calls and its interfacing on Tensorflow which is written in C (this is beyond my current TF expertise)
In my case, this memory deallocation error occurred whenever the
apply_gradients() method of an optimizer was called. In your script, it is called when the model is being fitted to the training data.
This error is raised from here: tensorflow/core/common_runtime/mkl_cpu_allocator.h
I hope this helps as a temporary solution for convenience.

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
tf.losses.mean_squared_error(labels,logits)
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=1)
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!

Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

How to reset the AdamOptimizer from Tensorflow while training

We are currently working on a project in which we change a cGAN architecture on Tensorflow to see if we get better results than standard cGANs. Due to the fact that we implement a progressivly growing architecture we would like to reset the AdamOptimizer from Tensorflow after each phase transition. Nonetheless we still did not manage to do so. We tried multiple approaches but either we get the error message "Graph is finalized and cannot be modified" or the parameters do not get reset.
Would be very thankful if somebody could give a hint or a general approach.

You just have to define the optimizer, gather the Adam variables and their initializers. Then, during the training, you can re-initialize the variables by running the initializers.
The following minimal example should point you in the right direction
import tensorflow as tf
x = tf.placeholder(tf.float32, shape=(None, 1))
y_hat = tf.layers.Dense(10)(x)
y = 10
loss = tf.reduce_mean(tf.squared_difference(y_hat, y))
train = tf.train.AdamOptimizer().minimize(loss)
print(tf.all_variables())
adam_vars = [var for var in tf.all_variables() if "adam" in var.name.lower()]
print(adam_vars)
adam_reset = [var.initializer for var in adam_vars]
with tf.Session() as sess:
# do stuff with your model: train, evaluate, whatever
# when the reset condition is met, run:
sess.run(adam_reset)

Issue with TensorFlow saving

I am training neural nets with TensorFlow, and the model's training is working using a custom implementation of batch gradient descent. I have a logging function which records validation error, and it gets down to about 2.6%. I'm saving the model every 10 epochs using a tf.train.Saver.
However, when I load the variables into memory again using a tf.train.Saver with the same script, the model performs poorly--with about the performance it does when the weights are randomly initialized. I have inspected the constitutional filters in the checkpoint and they don't seem to be random however.
I have not included all of my code, since its around 400 lines long, but I've included what seem to be important sections here and summarized the other functionality.
class ModelTrainer:
def __init__(self, ...hyperparameters...):
# Intitialize datasets and hyperparameters
for each gpu
# Create loss function and gradient assigned to this gpu using tf.device("/gpu:n")
with tf.device("/cpu:0")
# Average and clip gradients from the gpu's
# Create this batch gradient descent operation for each trainable variable
variable.assign_sub(learning_rate * averaged_and_clipped_gradient).op
def train(self, ...hyperparameters...)
saver = train.Saver(tf.all_variables(), max_to_keep = 30)
init = tf.initialize_all_variables()
sess = tf.Session()
if starting_point is not None: # Used to evaluate existing models
saver.restore(sess, starting_point)
else:
sess.run(init)
for i in range(number_of_batches)
# ... Get training batch ...
gradients = sess.run(calculate_gradients, feeds = training_batch)
# Average "gradients" variable across multiple batches
# Must be done because of GPU memory limitations
if i % meta_batch_size == 0:
sess.run(apply_gradients_operators,
feeds = gradients_that_have_been_averaged_across_multiple_batches)
# Log validation error
if i % save_after_n_batches == 0:
saver.save(sess, "some-filename", global_step=self.iter_num)
As expected, running these two functions creates a set of checkpoint files called "some-filename-40001" or whatever other iteration number the training is at when that file is saved. Unfortunately when I load these checkpoints back in using the start_point parameter they perform on par with random initialization.
Initially I assumed it was something to do with the way I'm training the model, since I haven't found anyone else with this issue, but the validation error behaves as expected.
Edit: More odd results. After more experimentation, I have found that when I load the saved model using the code:
with tf.Session() as sess:
saver = tf.train.import_meta_graph("saved-checkpoint-40.meta")
saver.restore(sess, "saved-checkpoint-40")
# ... Use model in some way ...
I get different, but still incorrect results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

tensorflow - memory leak? - python

Related

Structuring a Keras project to achieve reproducible results in GPU

Warning `tried to deallocate nullptr` when using tensorflow eager execution with tf.keras

OutOfRangeError: tensorflow iterator not reinitializing between runs

How to reset the AdamOptimizer from Tensorflow while training

Issue with TensorFlow saving

Categories

Resources