TensorFlow cannot use apply_gradients with AdamOptimizer - python

I want to use the AdamOptimizer, but I also want to edit my gradients every step.
The typical usage is as follows:
train_step = tf.train.AdamOptimizer(learning_rate).minimize(loss)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
sess.run(train_step, feed_dict=feed_dict)
This applies a single training step with the AdamOptimizer.
I want to modify the gradients every step, so I extract them and them reinsert them with the following code:
opt = tf.train.AdamOptimizer(learning_rate=1e-3)
grads_and_vars = opt.compute_gradients(loss)
train_opt = opt.apply_gradients(grads_and_vars)
sess.run(train_opt, feed_dict=feed_dict)
I would normally apply some operations to grads_and_vars, but I'm just trying to get this to work first. The previous code fails at sess.run(train_opt, feed_dict=feed_dict) because of the following error:
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value beta1_power_1
[[Node: beta1_power_1/read = Identity[T=DT_FLOAT, _class=["loc:#Variable"], _device="/job:localhost/replica:0/task:0/cpu:0"](beta1_power_1)]]
which is caused by train_opt = opt.apply_gradients(grads_and_vars), but am I not applying the gradients correctly?
There is no error with the GradientDescentOptimizer, so I know this must be the right way to extract the gradients and then reinsert them for a training step.
Is there something I'm missing? How can I use the AdamOptimizer this way?
EDIT: I mentioned that the second code block works with GradientDescentOptimizer, but it is about 10 times slower than the first code. Is there a way to speed that up?

run this sess.run(tf.local_variables_initializer()), there are local variables in adam, you need to initialize them

Related

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
tf.losses.mean_squared_error(labels,logits)
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=1)
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!
Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

How to reset the AdamOptimizer from Tensorflow while training

We are currently working on a project in which we change a cGAN architecture on Tensorflow to see if we get better results than standard cGANs. Due to the fact that we implement a progressivly growing architecture we would like to reset the AdamOptimizer from Tensorflow after each phase transition. Nonetheless we still did not manage to do so. We tried multiple approaches but either we get the error message "Graph is finalized and cannot be modified" or the parameters do not get reset.
Would be very thankful if somebody could give a hint or a general approach.
You just have to define the optimizer, gather the Adam variables and their initializers. Then, during the training, you can re-initialize the variables by running the initializers.
The following minimal example should point you in the right direction
import tensorflow as tf
x = tf.placeholder(tf.float32, shape=(None, 1))
y_hat = tf.layers.Dense(10)(x)
y = 10
loss = tf.reduce_mean(tf.squared_difference(y_hat, y))
train = tf.train.AdamOptimizer().minimize(loss)
print(tf.all_variables())
adam_vars = [var for var in tf.all_variables() if "adam" in var.name.lower()]
print(adam_vars)
adam_reset = [var.initializer for var in adam_vars]
with tf.Session() as sess:
# do stuff with your model: train, evaluate, whatever
# when the reset condition is met, run:
sess.run(adam_reset)

Tensorflow graph results appear random after restor

I trained a model to predict the next word in a sequence. I saved the model using tf.train.Saver(). However, when I go to restore the model and supply it the same seed value, the output changes each time I run the testing. For example, if I supply it with the words "happy birthday to", it will predict "you", but then , if I run it 10 seconds later, it will predict "rhyno". I have a feeling that this might be due to me randomly initializing the internal layers to random normal weights, however, wouldn't restoring the model restore the values after training and not reinitialize the layers? My restore code is below:
with tf.Session() as sess:
saved_model = tf.train.import_meta_graph(
'C:/Users/me/my_model.meta') # load graph from training
saved_model.restore(sess, tf.train.latest_checkpoint('./'))
imported_graph = tf.get_default_graph()
x = imported_graph.get_operation_by_name("ph_x").outputs[0]
prediction = imported_graph.get_tensor_by_name('prediction:0')
run_input = seed_values
print(np.array2string(run_input, separator=" "))
for _ in range(production_size):
run_input_oh = hlp.word_to_one_hot(run_input, hp_dict, 0)
pred = hlp.one_hot_to_word(sess.run(prediction, feed_dict={x: run_input_oh}), rev_dict)
print(sess.run(prediction, feed_dict={x: run_input_oh}))
You called default graph after restoring the saved weights. This will ignore restored weight.
Solution:
First call default graph, then restore the weights!
Try this!
with tf.Session() as sess:
saved_model = tf.train.import_meta_graph(
'C:/Users/me/my_model.meta') # load graph from training
imported_graph = tf.get_default_graph()
saved_model.restore(sess, tf.train.latest_checkpoint('./'))
...
#midhun pk's answer is mistaken, calling tf.get_default_graph() does not modify the graph, and calling it before or after saved_model.restore makes no difference.
Your code seems fine (calling import_meta_graph adds the nodes of the saved graph to the current graph, and calling restore restores the states of the variables), and it's difficult to debug without more information about your model. (eg what are run_input, seed_values, etc?) Can you provide a minimal reproducible example?
You should be able to verify if you variables are correctly restored by printing the value of the variable at save and restore time. Before saving, you can do print(sess.run(variable)) (or use tf.Print). After restoring, you can check the weights of the restored variables as follows: Supposing your variable's name is "XX", do:
var_value = imported_graph.get_tensor_by_name("XX:0")
print(sess.run(var_value))
I was able to find the issue, and it did not deal with the process of calling the saved weights.
When I first built the model for training, I created a dictionary from a text file by creating a set. In testing, I built the dictionary from the same text file, assuming that the order of elements would remain the same. Do not make this assumption. The order can change, hence the seemingly random results.

Tensorflow compute_gradients and apply_gradients running out of memory

I have the following lines as part of a program:
tensor_gradients = optimizer.compute_gradients(cross_entropy)
with tf.Session() as session:
for step in range(20000):
batch = mnist.train.next_batch(train_batch_size)
feed = {input_x: batch[0], input_y: batch[1]}
gradients = session.run([tensor_gradients], feed)[0]
for i in range(len(gradients)):
gradients[i] = (gradients[i][0], tensor_gradients[i][1])
... computation on gradients ...
training_step = optimizer.apply_gradients(gradients)
training = session.run([training_step], feed)
The reason I'm doing this is because I want to modify the gradients using numpy. The above code runs out of memory around step 800. However, if you replace the optimizer.apply_gradients step by tensor_gradients, then the code does not run out of memory.
training_step = optimizer.apply_gradients(tensor_gradients)
Any ideas at what might be happening? The rest of the code remains the same except for the line above. Is it possible that the numpy arrays in gradients is not being garbage collected because they are being passed into the apply_gradients step? I have no idea where the memory leak could be or if I'm inadvertently adding to the tensorflow graph by passing modified gradients (in numpy array form) back into apply_gradients.
Any ideas at what might be happening?
OOM happens because you're constructing the graph inside the loop: This builds a graph with 20,000x nodes, and running it may need more memory than you have.
Move all TF operations that build the graph outside the loop, i.e. everything except feed_dict construction and sess.run calls.
Reply to comments
Apply gradients builds the graph?
Yes, if you look in the docs:
Returns:
An `Operation` that applies the specified gradients. If `global_step`
was not None, that operation also increments `global_step`.

Issue with TensorFlow saving

I am training neural nets with TensorFlow, and the model's training is working using a custom implementation of batch gradient descent. I have a logging function which records validation error, and it gets down to about 2.6%. I'm saving the model every 10 epochs using a tf.train.Saver.
However, when I load the variables into memory again using a tf.train.Saver with the same script, the model performs poorly--with about the performance it does when the weights are randomly initialized. I have inspected the constitutional filters in the checkpoint and they don't seem to be random however.
I have not included all of my code, since its around 400 lines long, but I've included what seem to be important sections here and summarized the other functionality.
class ModelTrainer:
def __init__(self, ...hyperparameters...):
# Intitialize datasets and hyperparameters
for each gpu
# Create loss function and gradient assigned to this gpu using tf.device("/gpu:n")
with tf.device("/cpu:0")
# Average and clip gradients from the gpu's
# Create this batch gradient descent operation for each trainable variable
variable.assign_sub(learning_rate * averaged_and_clipped_gradient).op
def train(self, ...hyperparameters...)
saver = train.Saver(tf.all_variables(), max_to_keep = 30)
init = tf.initialize_all_variables()
sess = tf.Session()
if starting_point is not None: # Used to evaluate existing models
saver.restore(sess, starting_point)
else:
sess.run(init)
for i in range(number_of_batches)
# ... Get training batch ...
gradients = sess.run(calculate_gradients, feeds = training_batch)
# Average "gradients" variable across multiple batches
# Must be done because of GPU memory limitations
if i % meta_batch_size == 0:
sess.run(apply_gradients_operators,
feeds = gradients_that_have_been_averaged_across_multiple_batches)
# Log validation error
if i % save_after_n_batches == 0:
saver.save(sess, "some-filename", global_step=self.iter_num)
As expected, running these two functions creates a set of checkpoint files called "some-filename-40001" or whatever other iteration number the training is at when that file is saved. Unfortunately when I load these checkpoints back in using the start_point parameter they perform on par with random initialization.
Initially I assumed it was something to do with the way I'm training the model, since I haven't found anyone else with this issue, but the validation error behaves as expected.
Edit: More odd results. After more experimentation, I have found that when I load the saved model using the code:
with tf.Session() as sess:
saver = tf.train.import_meta_graph("saved-checkpoint-40.meta")
saver.restore(sess, "saved-checkpoint-40")
# ... Use model in some way ...
I get different, but still incorrect results.

Categories