TFLite Interpreter: defining optimal number of threads - python

I am running a quantized TFLite model (in Linux PC) for inference using XNNPack backend. I am aware that TFLite models may suffer high latency for prediction and i'm trying to optimize its performance defining number of threads to TFLite.Interpreter(num_threads=X).
I made some trials using X=[4, 6, 8, None] and the best scenario was with X=4, but this doesn't make sense to me. How it is defined the optimal number of threads? And more, defining num_threads automatically works with multiple CPUs or do i have to use another library/package?
(other optimizations that could speed up inference are very welcome!). The model i'm using is a quantized google BERT.
Thanks.

It depends on your target environment. If the target is a single or dual core machine and you're not allowed to use multiple cores for your application, you should use num_threads=1.
Otherwise, you may use more threads to leverage multiple cores.
If your target only has 4 cores, using higher than 4 doesn't give a performance improvement but gives only memory and context switching overhead. (Also shape of inputs are related depends on op kernel implementation)
Regarding the performance improvement,
usually integer operation is faster than float. So you can optimize your model to use integer operations.
https://www.tensorflow.org/lite/performance/model_optimization
Also if your target has GPU, you could try GPU delegate.
https://www.tensorflow.org/lite/performance/gpu

Related

Training multiple neural networks asynchronously in parallel

The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.

Tensorflow CentralStorageStrategy

The tf.distribute.experimentalCentralStorageStrategy specifies that Variables are not mirrored, instead, they are placed on CPU and ops are replicated across all GPUs.
If I have a really big model that does not fit on any single GPU, could this be a solution since variables are stored on CPU? I know that there will be networking overhead and that's fine.
This Official TF Tutorial on Youtube states that this could be used to handle "large embeddings" that would not fit on one GPU. Could this also be the case for large variables and activations?
In the official documentation, it states that "if there is only one GPU, all variables and operations will be placed on that GPU." If I only used 1 GPU, it seems that CentralStorageStrategy would be automatically disabled even though storing large variables (embeddings for example) on the CPU instead of GPU could be very valuable since there might not exist a GPU that has enough memory to fit it on device. Is this a design oversight or intended behavior?

Neural network inference with one image: Why is GPU utilization not 100%?

First of all: this question is connected to neural network inference and not training.
I have discovered, that when doing inference of a trained neural network with only one image over and over on a GPU (e.g. P100) the utilization of the computing power with Tensorflow is not reaching 100%, but instead around 70%. This is also the case if the image does not have to be transferred to the GPU. Therefore, the issue has to be connected to constraints in the parallelization of the calculations. My best guesses for the reasons are:
Tensorflow can only utilize the parallelization capabilities of a GPU up to a certain level. (Also the higher utilization of the same model as a TensorRT models suggest that). In this case, the question is: What is the reason for that?
The inherent neural network structure with several subsequent layers avoids that a higher usage is possible. Therefore the problem is not overhead of a framework but lies in the general design of neural networks. In this case, the question is: What are the restrictions to that?
Both of the above combined.
Thanks for your ideas on the issue!
Why do you expect the GPU utilization to go to 100% when you run the neuronal network prediction for one image?
The GPU utilization is per time unit (e.g. 1 second). This means, when the neuronal network algorithm finished before this time unit elapsed (e.g within 0.5s) Then the rest of the time the GPU may get used by other programs or not get used at all. If the GPU is not used by any other programs neither then well you will not reach 100%.

Tensorflow, OpenAI Gym, Keras-rl performance issue on basic reinforcement learning example

I'm doing reinforcement learning, and I'm having trouble with performance.
Situation, no custom code:
I loaded a Google Deep Learning VM (https://console.cloud.google.com/marketplace/details/click-to-deploy-images/deeplearning) on Google Cloud. This comes with all the prerequisites installed (CUDA, cuDNN, drivers) with a NVidia K80 videocard.
Installed keras-rl, OpenAI gym
Now when I run the (standard) example dqn_cartpole.py with visualize=False on line 46, it uses about 20% of my GPU, resulting in around 100 steps per second, which is around 3x SLOWER than using the CPU on my Razer Blade 15 (i7-8750H).
I have checked all the bottlenecks I can think of, CPU usage, memory and HD I/O is all normal.
Please help!
Thanks in advance
This is not necessarily a problem. Using a GPU does not come "for free" in terms of performance, and it's not always faster than a CPU. Because not everything runs on GPU (e.g. the gym environment itself still runs on CPU), you do incur "communication costs" (e.g. moving memory to and from the GPU). This will only be worth it if you can really make good use of your GPU.
Now, GPUs are also not necessarily faster than CPUs. GPUs are very good at performing lots of similar computations in parallel. This is necessary, for example, for matrix multiplications between large matrices, which indeed happens fairly often when training large, deep Neural Networks. If you only need a relatively small number of computations that can be done in parallel like that, and mostly just have sequential code, a GPU definitely can be slower than a CPU (and the CPU you mentioned is a rather powerful one).
Now, if I look at the part of the code you linked where the Neural Network is being built (starting from line 22), that looks like a rather small Neural Network; just a few layers of 16 nodes each. This is not a huge Neural Network, with Convolutional layers followed up by large (e.g. hundreds of nodes) fully connected layers (which would likely indeed be overkill for a small problem like Cartpole). So, it's certainly no surprise that you only get to use 20% of the GPU (it simply can't use more in parallel because the matrices being multiplied are too small), and not necessarily a surprise that it'd be slower than simply running on a CPU either.

Training changing input size RNN on Tensorflow

I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?
As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.

Categories