Recently I've been toying with TensorFlow and I mentioned that the framework is not able to use all my available computational resources. In Convolutional Neural Networks tutorial they mention that
Naively employing asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, employing fully synchronous updates will be as slow as the slowest model replica.
Although they mention it in both in the tutorial and in a whitepaper I did not really find a way to do the asynchronous parallel computation on a local machine. Is it even possible? Or is it part of the distributed to-be-released version of TensorFlow. If it is, then how?
Asynchronous gradient descent is supported in the open-source release of TensorFlow, without even modifying your graph. The easiest way to do it is to execute multiple concurrent steps in parallel:
loss = ...
# Any of the optimizer classes can be used here.
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
sess = tf.Session()
sess.run(tf.initialize_all_variables())
def train_function():
# TODO: Better termination condition, e.g. using a `max_steps` counter.
while True:
sess.run(train_op)
# Create multiple threads to run `train_function()` in parallel
train_threads = []
for _ in range(NUM_CONCURRENT_STEPS):
train_threads.append(threading.Thread(target=train_function))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
This example sets up NUM_CONCURRENT_STEPS calls to sess.run(train_op). Since there is no coordination between these threads, they proceed asynchronously.
It's actually more challenging to achieve synchronous parallel training (at present), because this requires additional coordination to ensure that all replicas read the same version of the parameters, and that all of their updates become visible at the same time. The multi-GPU example for CIFAR-10 training performs synchronous updates by making multiple copies of the "tower" in the training graph with shared parameters, and explicitly averaging the gradients across the towers before applying the update.
N.B. The code in this answer places all computation on the same device, which will not be optimal if you have multiple GPUs in your machine. If you want to use all of your GPUs, follow the example of the multi-GPU CIFAR-10 model, and create multiple "towers" with their operations pinned to each GPU. The code would look roughly as follows:
train_ops = []
for i in range(NUM_GPUS):
with tf.device("/gpu:%d" % i):
# Define a tower on GPU `i`.
loss = ...
train_ops.append(tf.train.GradientDescentOptimizer(0.01).minimize(loss))
def train_function(train_op):
# TODO: Better termination condition, e.g. using a `max_steps` counter.
while True:
sess.run(train_op)
# Create multiple threads to run `train_function()` in parallel
train_threads = []
for train_op in train_ops:
train_threads.append(threading.Thread(target=train_function, args=(train_op,))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
Note that you might find it convenient to use a "variable scope" to facilitate variable sharing between the towers.
Related
The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.
What I have is a system where I am reading from a model generating predictions in 3-4 separate processes continuously.
This is for a video game for Reinforcement Learning so I can not do workers/queues of data
Then I want to send the actions/rewards to a central process for learning after it updates the weights all the other processes will need updated weights too.
I have looked at
https://www.tensorflow.org/deploy/distributed
https://clusterone.com/blog/2017/09/13/distributed-tensorflow-clusterone/
Most examples are doing the opposite where the training is on the distributed machines.
How can I setup the task workers so the task they are running is just a prediction step instead of a train step?
train_step = (
tf.train.AdamOptimizer(learning_rate)
.minimize(loss, global_step=global_step)
)
Will not work in my case unless I can grab data outside of it.
Also each process is created externally to my control so tensorflow can not create the processes.
It is similar to this question:
How to run several Keras neural networks in parallel
But that question has no answers and it is based on thaneos where mine is on tensorflow.
Also similar to this:
Running Keras model for prediction in multiple threads
But mine is in separate processes not threads
I've read Distributed TensorFlow Doc and this question on StackOverflow but I still have some doubt about the dynamics behind the distributed training that can be done with TensorFlow and its Parameter Server Architecture.
This is a snipped of code from the Distributed TensorFlow Doc:
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
# Build model...
loss = ...
global_step = tf.contrib.framework.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
And here part of the answer of the StackOverflow question that I read:
The worker reads all of the shared model parameters in parallel from
the PS task(s), and copies them to the worker task. These reads are
uncoordinated with any concurrent writes, and no locks are acquired:
in particular the worker may see partial updates from one or more
other workers (e.g. a subset of the updates from another worker may
have been applied, or a subset of the elements in a variable may have
been updated).
The worker computes gradients locally, based on a batch
of input data and the parameter values that it read in step 1.
The
worker sends the gradients for each variable to the appropriate PS
task, and applies the gradients to their respective variable, using an
update rule that is determined by the optimization algorithm (e.g.
SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules
typically use (approximately) commutative operations, so they may be
applied independently on the updates from each worker, and the state
of each variable will be a running aggregate of the sequence of
updates received.
I have to reproduce this kind of parameter server architecture in another environment and I need to deeply understand how workers and PS tasks interact with each other inside the TensorFlow framework.
My question is, does the PS task do some kind of merging or updating operation after receiving the value from the workers or it just store the newest value ? Can be something reasonable just storing the newest value ? Looking at the code from the TensorFlow documentation I see that the PS task just do a join() and I wonder behind this method call which are the complete behaviour of the PS task.
One more question, what is the difference between compute a gradient and apply a gradient ?
Let's go in reverse order and start from your last question: what is the difference between compute a gradient and apply a gradient?
Computing the gradients means running the backward pass on the network, after having computed the loss. For gradient descent, this means estimating the gradients value in the formula beneath (note: this is a huge simplification of what computing gradients actually entails, look up more about backpropagation and gradient descent fora proper explanation of how this works). Applying the gradients means updating the parameters according to the gradients you just computed. For gradient descent, this (roughly) means executing the following:
weights = weights - (learning_step * gradients)
Note that, depending on the value of learning_step, the new value of weights depends on both the previous value and the computed weights.
With this in mind, it's easier to understand the PS/worker architecture. Let's make the simplifying assumption that there is only one PS (we'll see later how to extend to multi-PS)
A PS (parameter server) keeps in memory the weights (i.e. the parameters) and receives gradients, running the update step I wrote in the code above. It does this every time it receives gradients from a worker.
A worker, on the other hand, looks up what's the current value of weights in the PS, makes a copy of it locally, runs a forward and a backward pass of the network on a batch of data and gets new gradients, which then sends back to the PS.
Note the emphasis on "current": there is no locking or inter-process synchronization between workers and the PS. If a worker reads weights in the middle of an update (and, for example, half already have the new value and half are still being updated), that's the weights he'll use for the next iteration. This keeps things fast.
What if there's more PSs? No problem! The parameters of the network are partitioned among the PSs, the worker simply contacts all of them to get the new values for each chunk of the parameters and sends back only the gradients relevant to each specific PS.
I'm trying to train a (pretty big) neural network using a GPU. The network is written in pytorch. I use python 3.6.3 running on ubuntu 16.04. Currently, the code is running, but it's taking about twice as long as it should to run because my data-grabbing process using the CPU is run in series to the training process using the GPU. Essentially, I grab a mini-batch from file using a mini-batch generator, send that mini-batch to the GPU and then train the network on that minibatch. I've timed the two processes (grabbing a mini batch and training on that mini batch), and they are similar in how long they take (both take around 200ms). I'd like to do something similar to keras' fit_generator method which runs the data-grabbing in parallel to the training (it creates a que of minibatches that can be sent to the GPU when the GPU wants to train on that mini batch). What is the best way to do that? For concreteness, my data generator code and training code run something like this (pseudocode):
#This generator opens a file, grabs and yields a mini batch
def data_gen(PATH,batch_size=32):
with h5py.File(PATH,'r') as f:
for mini-batch in mini-batches:
X = f['X'][mini-batch]
Y = f['Y'][mini-batch]
yield (X,Y)
for epoch in range(epochs):
for data in data_gen(PATH):
mini_X,mini_Y = data
mini_X = autograd.Variable(torch.Tensor(mini_X))
mini_Y = autograd.Variable(torch.Tensor(mini_Y))
out = net(mini_X)
loss = F.binary_cross_entropy(out,mini_Y)
loss.backward()
optimizer.step()
Something like that. As you can see, I use the data_gen as an actual generator for the for-loop, so it's being run sequentially with the training. I would like to run it in parallel and have it generate a que of minibatches which I can then feed to my network. Currently, it takes more than 5 hours to run one epoch, I think with a parallelized version of this, I could get that down to 3 hours or less. I looked into multiprocessing on python, but the explanation on the official documentation was a bit dense for me since I have only limited prior experience in parallel computing. If there's some resources I could take a look at, pointing me towards those resources would be very helpful too! Thanks.
You will need to use threads for the data generation. The idea is to let the CPU handle the data generation (usually loading) while your GPU does the training. That been said, it is not the CPU that will slow things down. It is the constant reading and writing of files. If you are using a dataset make sure the files are copied or extracted into contiguous blocks on your file system. If your files are defragmented across your hard drive, loading them will be a bottleneck regardless of the multi-threading mechanism you are using. With SSD hard drives it is not noticeable.
I want to use Keras for a real-time training and prediction setting. In my scenario I get real-time data via MQTT that should be used to train a (LSTM) Neural Network and/or to apply them to the to get a prediction.
I am using the Tensorflow backend with GPU support and fairly potent GPU capacity but in my scenario Keras does not really profit from GPU acceleration. (I did some performance tests by using the examples in the keras repository to make sure that the GPU acceleration works in general). In my first approach, I used the model.train_on_batch(...) method to train the network with each item coming via MQTT:
model = load_model()
def on_message(msg):
"""
Method called by MQTT client each time new data comes in
"""
if msg.topic == 'my/topic':
X, Y = prepare_data(msg.payload)
prediction = model.predict(X)
loss = model.train_on_batch(X, Y)
send_to_visualization_tool(prediction, loss)
One training step in this setting takes about 200ms. However, when I introduce a buffer e.g. buffering 100 data points, the training time for the whole batch only increases slightly. This suggests that the setup time for batch training has a huge overhead. I also noticed that when using size 1 batches, the CPU consumption is quite high, while the GPU is hardly used at all.
As an alternative I now introduced a synchronized Queue, where the MQTT client pushes data, whenever data comes in and the Neural Network then consumes all data as a batch, that came in while processing the previous batch:
train_data_queue = Queue.Queue()
# MQTT client running in separate thread
def on_message(msg):
train_data_queue.put(msg.payload)
model = load_model()
while True:
train_data_batch = dequeue_all(train_data_queue) # dequeue all items from queue
# or block until at least one
# item is present
X, Y = prepare_data(train_data_batch)
predictions = model.predict_on_batch(X)
losses = model.train_on_batch(X, Y)
send_to_visualization_tool(predictions, losses)
This approach works okay but it would be nice if I could get rid of the additional complexity of synchronized Queues and multi threading. I.e. get first approach work.
My question therefore is: Is there a way to reduce the overhead of one batch trainings? E.g. by reimplementing the model in pure tensorflow?
Or can you think of a better way to do real-time training with Keras?
The performance of keras should be broadly similar to the performance of raw tensorflow, so I do not recommend rewriting your model.
Indeed modern hardware usually takes about the same time to train with a single example as it does with a batch of examples, which is why we spend so much effort batching things up. You can get rid of the complexity of synchronized queues if you want to use tf.contrib.batching.batch_function but you'll still need to feed it from many threads if you want to get the extra throughput.