I'm training a model where the input vector is the output of another model. This involves restoring the first model from a checkpoint file while initializing the second model from scratch (using tf.initialize_variables()) in the same process.
There is a substantial amount of code and abstraction, so I'm just pasting the relevant sections here.
The following is the restoring code:
self.variables = [var for var in all_vars if var.name.startswith(self.name)]
saver = tf.train.Saver(self.variables, max_to_keep=3)
self.save_path = tf.train.latest_checkpoint(os.path.dirname(self.checkpoint_path))
if should_restore:
self.saver.restore(self.sess, save_path)
else:
self.sess.run(tf.initialize_variables(self.variables))
Each model is scoped within its own graph and session, like this:
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph)
with self.sess.graph.as_default():
# Create variables and ops.
All the variables within each model are created within the variable_scope context manager.
The feeding works as follows:
A background thread calls sess.run(inference_op) on input = scipy.misc.imread(X) and puts the result in a blocking thread-safe queue.
The main training loop reads from the queue and calls sess.run(train_op) on the second model.
PROBLEM:
I am observing that the loss values, even in the very first iteration of the training (second model) keep changing drastically across runs (and become nan in a few iterations). I confirmed that the output of the first model is exactly the same everytime. Commenting out the sess.run of the first model and replacing it with identical input from a pickled file does not show this behaviour.
This is the train_op:
loss_op = tf.nn.sparse_softmax_cross_entropy(network.feedforward())
# Apply gradients.
with tf.control_dependencies([loss_op]):
opt = tf.train.GradientDescentOptimizer(lr)
grads = opt.compute_gradients(loss_op)
apply_gradient_op = opt.apply_gradients(grads)
return apply_gradient_op
I know this is vague, but I'm happy to provide more details. Any help is appreciated!
The issue is most certainly happening due to concurrent execution of different session objects. I moved the first model's session from the background thread to the main thread, repeated the controlled experiment several times (running for over 24 hours and reaching convergence) and never observed NaN. On the other hand, concurrent execution diverges the model within a few minutes.
I've restructured my code to use a common session object for all models.
Related
Is loading and predicting a model on different threads supported and correct in Tensorflow?
Some background:
When trying to load a model in thread A and then predicting it in thread B, we are given the following error message:
ValueError: Tensor SOME_TENSOR is not an element of this graph.
I've found this TF GitHub thread, proposing to solve it by storing the graph when loading the model and using it as default when predicting. Sort of like this:
# thread A
global graph
graph = tf.get_default_graph()
...
# thread B
with graph.as_default():
preds = model.predict(image)
I've tried doing that, yet I also got errors due to sessions being different and variables uninitialised:
tensorflow.python.framework.errors_impl.FailedPreconditionError:
Failed precondition: Error while reading resource variable lstm_2_3/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/lstm_2_3/bias)
I started by fixing it with keras.backend.get_session().run(tf.compat.v1.global_variables_initializer()) yet that didn't quite work, producing wrong predictions. Instead I decided to treat session in the same way as the graph, and pass it over from when I load the model.
So the solution I have is as follows:
# thread A
global graph
global sess
graph = tf.get_default_graph()
sess = K.get_session()
...
# thread B
with graph.as_default():
try:
preds = model.predict(image)
except FailedPreconditionError:
K.set_session(sess)
preds = model.predict(image)
Not gonna lie, it feels hacky. Is this the right way to handle model loading / prediction on separate threads? Is there anything wrong with this approach?
It seems that keras/tensorflow simply aren't thread-safe, however it is possible to make it work in this case. This is hardly the correct way to fix the issue, but what helped me is changing the way I load the model.
def load_threadsafe():
model = load() # your usual model loading
model._make_predict_function()
return model
Note the call to the protected _make_predict_function method, which is still hacky. The solution was shared by #fgerard on this issue. As far as I can tell _make_predict_function is called internally on the first prediction, and calling it from the non-main thread causes issues. Hence the solution is to call it explicitly before any predictions are made on the thread that loads the model.
I have following problem. I'm using a Tensorflow Keras model to evaluate continuous sensor data. My input for my model consists of 15 sensor data frames. Because the function model.predict() takes near 1 second I wanted to execute this function asynchronous so that I can collect the next data frames in this time period.
To accomplish this I created a Pool with the multiprocessing libary and a function to for model.predict. My code looks something like this:
def predictData(data):
return model.predict(data)
global model
model = tf.keras.models.load_model("Network.h5")
model._make_predict_function()
p = Pool(processes = 4)
...
res = p.apply_async(predictData, ([[iinput]],))
print(res.get(timeout = 10))
Now I always get a timeout-error when calling predictData(). It seems like model.predict() is not working right. What am I making wrong?
The reason is that each process you spawn will require a new initialized version of your model which it uses to make predictions. Therefore you have to make sure you instantiate/load your model for every spawned process. This is defiantly not optimal.
This is a known caveat with multiprocessing machine learning training and/or inference. Some libraries come with multiprocessing features out-of-the-box and provide parallizable calls to their models. However, in most libraries once you want to do multiprocessing, you are on your own!
Make sure you instantiate your model once and then find a way to share that model across processes. One basic way to do that, is to serve your model as a flask service then make predictions against that service to your hearts content. Cheers!
It is possible to run multiple predictions in multiple concurrent python processes, only you have to build inside each independent process its own tensorflow computational graph and then call the keras.model.predict
Write a function which you will use with the multiprocessing module (with the Process or Pool class),
within this function you should build your model, tensorflow graph and whatever you need,
set all tensorflow and keras variables, then you can call the predict method on it,
and then pipe the result back to your master process.
for example:
def f(data):
import tensorflow, keras
configure your tensorflow and keras settings (e.g. GPU/CPU usage)
keras_model = build_your_keras_model()
result = keras_model.predict(data)
return result
if __main__ = '__main__':
p = Pool(processes = 4)
res = p.apply_async(f, (data,))
print(res.get(timeout = 10))
I have trained a neural network with TensorFlow. After training i saved it and loaded it again in a new '. py' file to avoid retraining on accident. As i was testing it with some extra data i found out that it predicts different things for the same data. Should it not theoretically compute the same thing for the same data?
Some information
feed forward net
4 hidden layers with 900 neurons each
5000 training epochs
reached accuracy of ~80%
data was normalized using normalize from sklearn. preprocessing
cost function: tensorflow.nn.softmax_cross_entropy_with_logits
optimizer: tf.train.AdamOptimizer
I am giving my network the data as a matrix, same way i used for training. (each row containing a data sample, having as many columns as there are input neurons)
Out of ten prediction cycles with the same data my network produces different results in at least 2 cycles (max observed 4 so far)
How can this be. By theory all that is happening are data processing calculations of the form W_i*x_i + b_i. As my x_i, W_i and b_i do not change anymore how come that the prediction varies? May there be a mistake in model reloading routine?
with tf.Session() as sess:
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
result = (sess.run(tf.argmax(prediction.eval(feed_dict=x:input_data}),1)))
print(result)
So this is a really stupid mistake by me. Now it works fine with loading the model from a save. The problem was caused by the global variables initializer. If you leave it out, it will work fine. The previously found information may prove useful for someone so i will leave it here. Solution is now:
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, 'path to your saved file C:x/y/z/model/model.ckpt')
After this you can go on as usually. I do not really know why variables initializer prevents this from working. As i see it, it should be something like: initialize all variables to exist and with random values and then got to that saved file and use values from there, but apparently something else happens...
So i have been doing some testing and found out the following about this problem.
As i have been trying to reuse my created model i had to use the tf.global_variables_initializer(). By doing so it has overwritten my imported graph and all the values were random, which explains different network outputs. This still left me with a problem to solve: how do i load my network? The workaround i am currently using is not optimal by far but it at least allows me to use my saved model. Tensor flow allows one to give unique names to the functions and tensors used. By doing so i could access them through the graph:
with tf.Session() as sess:
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
graph = tf.get_default_graph()
graph.get_tensor_by_name('name:0')
Using this method i could access all my saved values, but they were separated! It means that i had 1x weight and 1x bias per operation used, which led to a bunch of new variables. If you do not know the names, use following:
print(graph.get_all_collection_keys())
This prints the collection names (our variables are stored in collections)
print(graph.get_collection('name'))
This allows us to access the collection as see what are the names/keys for our variables.
This led to another problem. I could no longer use my model as global variables initializer had everything overwritten. By thus i had to redefine the whole model manually with weight and biases that i got previously.
Unfortunately, this is the only thing i could come up with. If anyone has a better idea, please let me know.
The whole thing with mistake looked like this:
imports...
placeholders for data...
def my_network(data):
## network definition with tf functions ##
return output
def train_my_net():
prediction = my_network(data)
cost function
optimizer
with tf.Session() as sess:
for i in how many epochs i want:
training routine
save
def use_my_net():
prediction = my_network(data)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
print(sess.run(prediction.eval(feed_dict={placeholder:data})))
graph = tf.get_default_graph()
I've read Distributed TensorFlow Doc and this question on StackOverflow but I still have some doubt about the dynamics behind the distributed training that can be done with TensorFlow and its Parameter Server Architecture.
This is a snipped of code from the Distributed TensorFlow Doc:
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
# Build model...
loss = ...
global_step = tf.contrib.framework.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
And here part of the answer of the StackOverflow question that I read:
The worker reads all of the shared model parameters in parallel from
the PS task(s), and copies them to the worker task. These reads are
uncoordinated with any concurrent writes, and no locks are acquired:
in particular the worker may see partial updates from one or more
other workers (e.g. a subset of the updates from another worker may
have been applied, or a subset of the elements in a variable may have
been updated).
The worker computes gradients locally, based on a batch
of input data and the parameter values that it read in step 1.
The
worker sends the gradients for each variable to the appropriate PS
task, and applies the gradients to their respective variable, using an
update rule that is determined by the optimization algorithm (e.g.
SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules
typically use (approximately) commutative operations, so they may be
applied independently on the updates from each worker, and the state
of each variable will be a running aggregate of the sequence of
updates received.
I have to reproduce this kind of parameter server architecture in another environment and I need to deeply understand how workers and PS tasks interact with each other inside the TensorFlow framework.
My question is, does the PS task do some kind of merging or updating operation after receiving the value from the workers or it just store the newest value ? Can be something reasonable just storing the newest value ? Looking at the code from the TensorFlow documentation I see that the PS task just do a join() and I wonder behind this method call which are the complete behaviour of the PS task.
One more question, what is the difference between compute a gradient and apply a gradient ?
Let's go in reverse order and start from your last question: what is the difference between compute a gradient and apply a gradient?
Computing the gradients means running the backward pass on the network, after having computed the loss. For gradient descent, this means estimating the gradients value in the formula beneath (note: this is a huge simplification of what computing gradients actually entails, look up more about backpropagation and gradient descent fora proper explanation of how this works). Applying the gradients means updating the parameters according to the gradients you just computed. For gradient descent, this (roughly) means executing the following:
weights = weights - (learning_step * gradients)
Note that, depending on the value of learning_step, the new value of weights depends on both the previous value and the computed weights.
With this in mind, it's easier to understand the PS/worker architecture. Let's make the simplifying assumption that there is only one PS (we'll see later how to extend to multi-PS)
A PS (parameter server) keeps in memory the weights (i.e. the parameters) and receives gradients, running the update step I wrote in the code above. It does this every time it receives gradients from a worker.
A worker, on the other hand, looks up what's the current value of weights in the PS, makes a copy of it locally, runs a forward and a backward pass of the network on a batch of data and gets new gradients, which then sends back to the PS.
Note the emphasis on "current": there is no locking or inter-process synchronization between workers and the PS. If a worker reads weights in the middle of an update (and, for example, half already have the new value and half are still being updated), that's the weights he'll use for the next iteration. This keeps things fast.
What if there's more PSs? No problem! The parameters of the network are partitioned among the PSs, the worker simply contacts all of them to get the new values for each chunk of the parameters and sends back only the gradients relevant to each specific PS.
Recently I've been toying with TensorFlow and I mentioned that the framework is not able to use all my available computational resources. In Convolutional Neural Networks tutorial they mention that
Naively employing asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, employing fully synchronous updates will be as slow as the slowest model replica.
Although they mention it in both in the tutorial and in a whitepaper I did not really find a way to do the asynchronous parallel computation on a local machine. Is it even possible? Or is it part of the distributed to-be-released version of TensorFlow. If it is, then how?
Asynchronous gradient descent is supported in the open-source release of TensorFlow, without even modifying your graph. The easiest way to do it is to execute multiple concurrent steps in parallel:
loss = ...
# Any of the optimizer classes can be used here.
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
def train_function():
# TODO: Better termination condition, e.g. using a `max_steps` counter.
while True:
sess.run(train_op)
# Create multiple threads to run `train_function()` in parallel
train_threads = []
for _ in range(NUM_CONCURRENT_STEPS):
train_threads.append(threading.Thread(target=train_function))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
This example sets up NUM_CONCURRENT_STEPS calls to sess.run(train_op). Since there is no coordination between these threads, they proceed asynchronously.
It's actually more challenging to achieve synchronous parallel training (at present), because this requires additional coordination to ensure that all replicas read the same version of the parameters, and that all of their updates become visible at the same time. The multi-GPU example for CIFAR-10 training performs synchronous updates by making multiple copies of the "tower" in the training graph with shared parameters, and explicitly averaging the gradients across the towers before applying the update.
N.B. The code in this answer places all computation on the same device, which will not be optimal if you have multiple GPUs in your machine. If you want to use all of your GPUs, follow the example of the multi-GPU CIFAR-10 model, and create multiple "towers" with their operations pinned to each GPU. The code would look roughly as follows:
train_ops = []
for i in range(NUM_GPUS):
with tf.device("/gpu:%d" % i):
# Define a tower on GPU `i`.
loss = ...
train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))
def train_function(train_op):
# TODO: Better termination condition, e.g. using a `max_steps` counter.
while True:
sess.run(train_op)
# Create multiple threads to run `train_function()` in parallel
train_threads = []
for train_op in train_ops:
train_threads.append(threading.Thread(target=train_function, args=(train_op,))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
Note that you might find it convenient to use a "variable scope" to facilitate variable sharing between the towers.