I use torch to build a model. In the training loop, I need to compute a dictionary of values which will happen on GPU. Then, every few iterations, I need to use the dictionary which was on GPU to perform some other tasks on CPU. So there is a back and forth between the two. Currently, I use torch.save and torch.load to save and load the said dictionary which is very slow and not thread-safe.
I am almost sure that there is a better way to accomplish what I am trying to do. I understand that copying data to/from GPU causes slowdown but I am looking for a strategy that does not involve the disk.
Related
I am facing a problem of improving the training speed / efficiency of a Tensorflow implementation of point cloud object detection algorithm.
The input data is a [8000, 100, 9] float32 tensor, with a size roughly 27MB per sample. On a batch size of 5, data loading becomes a bottleneck in training as most of the time GPU utlization rate is 0% until data arrives.
I have tried the following methods to increase data loading speed.
Use num_parallel_calls in tf.Dataset .map API, and use multiple threads for reading this big tensor. The problem is .map wraps a py_fun which is subject to Global Interpreter Lock and thus multi-threading does not improve I/O efficiency.
Use tf.Dataset .interleave API. Since it's also multi-threading based, it has the same problem as 2.
Use TFRecord format. This is even slower than method 1 and 2. Possibility is TFRecord will convert tensor to numpy, then serialize numpy to bytes, then wrap this bytes to tensorflow structure and write to disk. Numpy to Tensor takes a long time for my data as measured by tf.convert_to_tensor().
Any suggestions how to move forward would be helpful. Thanks!
Follow up on comments
Am I using slow disks? Data is stored on a mounted disk. Could be a reason.
Can the data be fit into GPU memory? Unfortunately no. There are ~70,000 samples. I tried cache a small dataset into RAM and GPU utlization rate is 30%~40%, which is probably the highest expectation for this particular network.
Some ideas:
You should use a combination of 1,2 and 3. If you save your files as TFRecords, you can read them in parallel, that's what they are designed for. Then, you will be able to use num_parallel_calls and interleave, because that way you don't have to wrap a py_func.
.map doesn't have to wrap a .py_func, you could for example use tf.keras.utils.get_file. That way you also avoid using py_func and use num_parallel_calls efficiently. I still recommend using TFRecords, they are designed for this use case.
Another option is to use an SSD to store your data instead of a Hard Disk.
You can also look into the .cache function of the tf.Dataset API. Maybe you can try loading a random subset of the data, training multiple eopchs on that, and then in the mean time fetch another subset of the data (using tf.prefetch), and then train multiple epochs on that, and so on. This idea is more of a long shot as it might affect performance, but it just might work in your case.
The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.
Now, I am running MTCNN(implement on Tensorflow) for face recognition on the GPU.
Since MTCNN using three models, PNet, RNet, ONet, and between them, running some steps by NumPy.
For example, when get the output from PNet, it will do some numpy.transpose on the output to get boxes, then pass these boxes to RNet.
So I suppose only PNet, RNet and ONet models would be run on GPU, other NumPy steps would be on CPU. Then it will copy the output from GPU memory to main memory. It would be quite waste time.
Is my guess right?
For improve the performance, I want to put all MTCNN calculations on GPU.
Anyone would give me some idea or an example?
First of all, the memcopy between GPU and CPU can be not that expensive, I would measure this first.
If you use TensorFlow-GPU, all the operations will be automatically performed on GPU whenever possible. Of course, this will not work with NumPy steps. So, the way to go would be replace the NumPy routines with the TensorFlow ones. TF has quite rich tensor manipulation library, which is similar in API to NumPy. Compare, for instance np.transpose to tf.transpose.
I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?
As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.
Is there a way to build following graph in Tensorflow:
Load some N images (N can vary for each set) using TF Queues and TF Image Readers.
Process these images to get fixed size image and prepare batches.
Feed these batches through the CNN model
Some questions/info:
I am trying to build data loading part in TF instead of Python functions and feed_dict. I guess, TF Data loading can train the model faster compared to python and feed_dict. Is that right ?
Building the graph for small N (N<5) is easy. Define exclusive nodes for each image in N and process on them. (working)
Can I use TF "while_loop" to build such functionality to read N images ??
Does Keras supports such functionality ?
Thanks for your suggestions.
I just did this last week! It was awesome, I learned a ton about tensorflow using things like tf.map_fn, and tf.cond. And it worked.
This week I just refactored my code to eliminate it all, because it was a bad idea.
Issues I ran into:
Doing preprocessing in tensorflow is messy to debug. Doing proper TDD will definitely benefit you here, but still not going to be particularly pretty or easy to debug.
You should be offloading the preprocessing to the CPU and leaving the GPU (assuming you're using one) to do training. A better approach is to just have a queue and load it from a thread/class that's dedicated to your preprocessing task. And doing the work in numpy/scikit/scikit-image is going to be easier to configure and test.
I thought I was so smart, corralling all my code into a single model. But the complexity of the preprocessing meant my model was really hard to iterate on, it got to be rigid code quickly - example is when I added my test set evaluation in, the preprocessing requirement was slightly different. Suddenly I had to add large sections of conditional code to my model and it got ugly quick.
That being said, my preprocessing steps were maybe more complex than yours. If you're sticking to simple things where you can just apply some of the simple image preprocessing steps it might still be easier for you to go this approach.
To answer your questions specifically:
Queues won't give any benefit over feed_dict that I know of. You still have a problem of moving data from a TF queue on the CPU to the GPU memory each iteration same as feed_dict does, watch this thread if you care about that topic, GPU queues are coming: https://github.com/tensorflow/tensorflow/issues/7679
You should just dequeue_many from the queue, process them as a batch. If you need to do something to each individual image just use tf.map_fn which will remove the first dimension and pass individual 3D images to your specified function. But heed my warning above when you go this route - you'll probably be happier just doing this in a separate thread.
Already answered in #2, use tf.map_fn to iterate over multiple images in a batch. it's pretty easy to use actually.
I don't know Keras.