Using model.predict (Keras + TF) in multiprocessing - python

I have following problem. I'm using a Tensorflow Keras model to evaluate continuous sensor data. My input for my model consists of 15 sensor data frames. Because the function model.predict() takes near 1 second I wanted to execute this function asynchronous so that I can collect the next data frames in this time period.
To accomplish this I created a Pool with the multiprocessing libary and a function to for model.predict. My code looks something like this:
def predictData(data):
return model.predict(data)
global model
model = tf.keras.models.load_model("Network.h5")
model._make_predict_function()
p = Pool(processes = 4)
...
res = p.apply_async(predictData, ([[iinput]],))
print(res.get(timeout = 10))
Now I always get a timeout-error when calling predictData(). It seems like model.predict() is not working right. What am I making wrong?

The reason is that each process you spawn will require a new initialized version of your model which it uses to make predictions. Therefore you have to make sure you instantiate/load your model for every spawned process. This is defiantly not optimal.
This is a known caveat with multiprocessing machine learning training and/or inference. Some libraries come with multiprocessing features out-of-the-box and provide parallizable calls to their models. However, in most libraries once you want to do multiprocessing, you are on your own!
Make sure you instantiate your model once and then find a way to share that model across processes. One basic way to do that, is to serve your model as a flask service then make predictions against that service to your hearts content. Cheers!

It is possible to run multiple predictions in multiple concurrent python processes, only you have to build inside each independent process its own tensorflow computational graph and then call the keras.model.predict
Write a function which you will use with the multiprocessing module (with the Process or Pool class),
within this function you should build your model, tensorflow graph and whatever you need,
set all tensorflow and keras variables, then you can call the predict method on it,
and then pipe the result back to your master process.
for example:
def f(data):
import tensorflow, keras
configure your tensorflow and keras settings (e.g. GPU/CPU usage)
keras_model = build_your_keras_model()
result = keras_model.predict(data)
return result
if __main__ = '__main__':
p = Pool(processes = 4)
res = p.apply_async(f, (data,))
print(res.get(timeout = 10))

Related

Tensorflow - How to split batches between GPUs for predicting on trained models?

I'm using models I didn't create but modified (from this repo https://github.com/GeorgeSeif/Semantic-Segmentation-Suite)
I have trained models and can use them to predict well enough but I want to run entire folders of images through and split the work between multiple gpus. I don't fully understand how tf.device() works and what I have tried didnt work at all.
I assumed I could do something like so:
for i, d in enumerate(['\gpu:0', '\gpu:1']):
with tf.device(d):
output = sess.run(network, feed_dict={net_input: image_batch[i]})
But this doesnt actually allocate the tasks to the different GPUs, it doesn't raise an error either.
My question is, is it possible to allocate the different images to different instances of the session on seperate GPUs without explicitly modifying the network code pre train. I would like to avoid running two different python scripts with CUDA_VISIBLE_DEVICES = ...
Is there a simple way to do this?
From what I understand the definitions of the operations have to be nested in a "with tf.device()" block, however when inferencing the operation is just the loading of the model and weights but if I put that in a "with tf.device()" block I get an error saying the graph already exists and cannot be defined twice.
tf.device only applies when building the graph, not executing it, so wrapping session.run calls in a device context does nothing.
Instead I recommend you use tf replicator or tf distribution strategy (tf.distribute / tf.contrib.distribute depending on the tf version), specifically the MirroredStrategy.

How to run same Keras model on different GPUs in parallel independently with different data?

Let's say I have two instances of a keras model model0 and model1 and datasets data0 and data1. If I have two or more GPUs, is there a way that I can train model0 on data0 on GPU0 and model1 on data1 on GPU1 in parallel? All of the methods I have found so far split the training of a single model over multiple gpus.
Thanks!
How about multiprocessing?
You just execute you function in a multiprocessing pool twice:
What you need to consider:
Your model have to be defined or loaded inside the function
You need a parameter which mask the GPUs. Masking is possible by setting the env variable CUDA_VISIBLE_DEVICES(you also have to do this inside the function)
You could pass the different training data via a parameter
it would be best to save the resulting models into different files and then load it from your main program
So basically passing keras/tensorflow sessions between your main programm and the functions in the multiprocessing tool is a nogo. But if you keep everything keras/tensorflow related inside the function and mask the GPUs differently then you're good to go.

Using Keras for real-time training and predicting

I want to use Keras for a real-time training and prediction setting. In my scenario I get real-time data via MQTT that should be used to train a (LSTM) Neural Network and/or to apply them to the to get a prediction.
I am using the Tensorflow backend with GPU support and fairly potent GPU capacity but in my scenario Keras does not really profit from GPU acceleration. (I did some performance tests by using the examples in the keras repository to make sure that the GPU acceleration works in general). In my first approach, I used the model.train_on_batch(...) method to train the network with each item coming via MQTT:
model = load_model()
def on_message(msg):
"""
Method called by MQTT client each time new data comes in
"""
if msg.topic == 'my/topic':
X, Y = prepare_data(msg.payload)
prediction = model.predict(X)
loss = model.train_on_batch(X, Y)
send_to_visualization_tool(prediction, loss)
One training step in this setting takes about 200ms. However, when I introduce a buffer e.g. buffering 100 data points, the training time for the whole batch only increases slightly. This suggests that the setup time for batch training has a huge overhead. I also noticed that when using size 1 batches, the CPU consumption is quite high, while the GPU is hardly used at all.
As an alternative I now introduced a synchronized Queue, where the MQTT client pushes data, whenever data comes in and the Neural Network then consumes all data as a batch, that came in while processing the previous batch:
train_data_queue = Queue.Queue()
# MQTT client running in separate thread
def on_message(msg):
train_data_queue.put(msg.payload)
model = load_model()
while True:
train_data_batch = dequeue_all(train_data_queue) # dequeue all items from queue
# or block until at least one
# item is present
X, Y = prepare_data(train_data_batch)
predictions = model.predict_on_batch(X)
losses = model.train_on_batch(X, Y)
send_to_visualization_tool(predictions, losses)
This approach works okay but it would be nice if I could get rid of the additional complexity of synchronized Queues and multi threading. I.e. get first approach work.
My question therefore is: Is there a way to reduce the overhead of one batch trainings? E.g. by reimplementing the model in pure tensorflow?
Or can you think of a better way to do real-time training with Keras?
The performance of keras should be broadly similar to the performance of raw tensorflow, so I do not recommend rewriting your model.
Indeed modern hardware usually takes about the same time to train with a single example as it does with a batch of examples, which is why we spend so much effort batching things up. You can get rid of the complexity of synchronized queues if you want to use tf.contrib.batching.batch_function but you'll still need to feed it from many threads if you want to get the extra throughput.

Multiple sessions and graphs in Tensorflow (in the same process)

I'm training a model where the input vector is the output of another model. This involves restoring the first model from a checkpoint file while initializing the second model from scratch (using tf.initialize_variables()) in the same process.
There is a substantial amount of code and abstraction, so I'm just pasting the relevant sections here.
The following is the restoring code:
self.variables = [var for var in all_vars if var.name.startswith(self.name)]
saver = tf.train.Saver(self.variables, max_to_keep=3)
self.save_path = tf.train.latest_checkpoint(os.path.dirname(self.checkpoint_path))
if should_restore:
self.saver.restore(self.sess, save_path)
else:
self.sess.run(tf.initialize_variables(self.variables))
Each model is scoped within its own graph and session, like this:
self.graph = tf.Graph()
self.sess = tf.Session(graph=self.graph)
with self.sess.graph.as_default():
# Create variables and ops.
All the variables within each model are created within the variable_scope context manager.
The feeding works as follows:
A background thread calls sess.run(inference_op) on input = scipy.misc.imread(X) and puts the result in a blocking thread-safe queue.
The main training loop reads from the queue and calls sess.run(train_op) on the second model.
PROBLEM:
I am observing that the loss values, even in the very first iteration of the training (second model) keep changing drastically across runs (and become nan in a few iterations). I confirmed that the output of the first model is exactly the same everytime. Commenting out the sess.run of the first model and replacing it with identical input from a pickled file does not show this behaviour.
This is the train_op:
loss_op = tf.nn.sparse_softmax_cross_entropy(network.feedforward())
# Apply gradients.
with tf.control_dependencies([loss_op]):
opt = tf.train.GradientDescentOptimizer(lr)
grads = opt.compute_gradients(loss_op)
apply_gradient_op = opt.apply_gradients(grads)
return apply_gradient_op
I know this is vague, but I'm happy to provide more details. Any help is appreciated!
The issue is most certainly happening due to concurrent execution of different session objects. I moved the first model's session from the background thread to the main thread, repeated the controlled experiment several times (running for over 24 hours and reaching convergence) and never observed NaN. On the other hand, concurrent execution diverges the model within a few minutes.
I've restructured my code to use a common session object for all models.

Asynchronous computation in TensorFlow

Recently I've been toying with TensorFlow and I mentioned that the framework is not able to use all my available computational resources. In Convolutional Neural Networks tutorial they mention that
Naively employing asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, employing fully synchronous updates will be as slow as the slowest model replica.
Although they mention it in both in the tutorial and in a whitepaper I did not really find a way to do the asynchronous parallel computation on a local machine. Is it even possible? Or is it part of the distributed to-be-released version of TensorFlow. If it is, then how?
Asynchronous gradient descent is supported in the open-source release of TensorFlow, without even modifying your graph. The easiest way to do it is to execute multiple concurrent steps in parallel:
loss = ...
# Any of the optimizer classes can be used here.
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
def train_function():
# TODO: Better termination condition, e.g. using a `max_steps` counter.
while True:
sess.run(train_op)
# Create multiple threads to run `train_function()` in parallel
train_threads = []
for _ in range(NUM_CONCURRENT_STEPS):
train_threads.append(threading.Thread(target=train_function))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
This example sets up NUM_CONCURRENT_STEPS calls to sess.run(train_op). Since there is no coordination between these threads, they proceed asynchronously.
It's actually more challenging to achieve synchronous parallel training (at present), because this requires additional coordination to ensure that all replicas read the same version of the parameters, and that all of their updates become visible at the same time. The multi-GPU example for CIFAR-10 training performs synchronous updates by making multiple copies of the "tower" in the training graph with shared parameters, and explicitly averaging the gradients across the towers before applying the update.
N.B. The code in this answer places all computation on the same device, which will not be optimal if you have multiple GPUs in your machine. If you want to use all of your GPUs, follow the example of the multi-GPU CIFAR-10 model, and create multiple "towers" with their operations pinned to each GPU. The code would look roughly as follows:
train_ops = []
for i in range(NUM_GPUS):
with tf.device("/gpu:%d" % i):
# Define a tower on GPU `i`.
loss = ...
train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))
def train_function(train_op):
# TODO: Better termination condition, e.g. using a `max_steps` counter.
while True:
sess.run(train_op)
# Create multiple threads to run `train_function()` in parallel
train_threads = []
for train_op in train_ops:
train_threads.append(threading.Thread(target=train_function, args=(train_op,))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
Note that you might find it convenient to use a "variable scope" to facilitate variable sharing between the towers.

Categories