Distributed training of a GRU network on Spark - PyTorch implementation

Distributed training of a GRU network on Spark - PyTorch implementation - python

I have an implementation of a GRU based network in PyTorch, which I train using a 4 GB GPU present in my laptop, and obviously it takes a lot of time (4+ hrs for 1 epoch). I am looking for ideas/leads on how I can move this deep-learning model to train on a couple of spark clusters instead.
So far, I have only come across this GitHub library called SparkTorch, which unfortunately has limited documentation and the examples provided are way too trivial.
https://github.com/dmmiller612/sparktorch
To summarize, I am looking for answers to the following two questions:
Is it a good idea to train a deep learning model on spark clusters, since I read at places that the communication overhead undermines the gains in training speed
How to convert the PyTorch model (and the underlying dataset) in order to perform a distributed training across the worker nodes.
Any leads appreciated.

Related

Can a Tensorflow saved model created on a large GPU be used to make predictions on a small CPU?

Let’s say one saves a Tensorflow model created using large data and GPUs. If one wanted to then use the saved model to do a single prediction with one small piece of data, does one still need the huge computers that created the model?
I’m wondering more generally how the size and computing resources needed to generate a deep learning model relate to using the model to make predictions.
This is relevant because if one is using Google Cloud Compute it costs more money if one has to use the huge computers all the time. If one could just use the huge computers to train the model and then more modest ones to run their app that make predictions it would save a lot of money.

Resources needed for prediction depend on the model size - not on the training device.
If the model has 200 bln variables - you will not be able to run it on workstation (because you have not enough memory).
But you can use model with 10 mln variables with no problems even if it was trained on GPU or TPU.
Every variable takes 4 to 8 bytes. If you have 8 GB of memory - you will probably be able to run a model with hundreds million variables.
Prediction is fast (assuming you have enough memory). Resources needed to train model quickly. It is efficient to train on GPU/TPU even if your model is small.

How to split one Keras model among multiple GPUs

I work on medical imaging so I need to train massive 3D CNNs that are difficult to fit into one GPU. I wonder if there is way to split a massive Keras or TensorFlow graph amongst multiple GPUs such that each GPU only runs a small part of the graph during training and inference. Is this type of distribute training possible with either Keras or TensorFlow?
I have tried using with tf.device('\gpu:#') when building the graph but I am experiencing memory overflow. The logs seem to indicate the entire graph is still being run on gpu:0.

transfer learning with bad computer

I want to use transfer learning to classify image. That my first try using transfer learning. I curently use VGG16 model. Since my data are very different from image used for the original training model, theory told me I should train many layers, potentialy including hidden layers.
My computer has 8GO ram, using i5 2.40 Hz no gpu. My data set is small (3000 images), but data are stored as matrix in python memory, not saved in a folder. Almost all my RAM is takent by those images
Original VGG16 model has 130 million parameters. If I only take weight of hiden layer, and create 2 new (and small, size 512 and 256) fully connected layer at the end, I still have 15M parameter to train, for a total of 30m parameter.
I actually use image size of 224*224 like vgg16 input
My computer need 1H30 for 1 epoch. At 10 epoch I have a bad accuracy (50% vs 90% with conv net from scratch).
My question:
computer crash after X epoch, I don't know why. Could it be RAM problem? Since when vgg started to train for 1 epoch, and other epoch are just weight adjustement, other epoch should not impact memory?
Should I unfreeze input layer to use image of reduced dimension to reduce memory problem and training time? It'll not affect too much conv net performance?
Is it normal to need 1h30 to compute 1 epoch with 15M trainable parameter? Since I still need to find optimal number of layer to unfreeze, shape of new fully connected layer, learning rate, otpimizer... it's look impossible to me to optimise a transfer learning model with my curent commputing ressources in a decent amount of time
Do you have any tips to for transfer learning?
thanks

No specific tips for transfer learning, but if you are lacking on computing power, it might be helpful to consider transitioning to cloud resources. AWS, Google cloud, Azure or other services are available at really reasonable prices.
Most of them also provide some free resources, which can be enough for small ML projects or student tasks.
Notably:
Google colab provides a free GPU for a limited time
AWS provides ~ 250 hours of training per month on sagemaker
Azure notebooks also provides some free (but limited) computing power
Most of these services also provide free general compute power, on which you can also run ML tasks, but that might require some additional manual tweaking.

Is it possible to remove categories in a pretrained tensorflow model?

I am currently using Tensorflow Object Detection API for my human detection app.
I tried filtering in the API itself which worked but I am still not contended by it because it's slow. So I'm wondering if I could remove other categories in the model itself to also make it faster.
If it is not possible, can you please give me other suggestions to make the API faster since I will be using two cameras. Thanks in advance and also pardon my english :)

Your questions addresses several topics for using neural network pretrained models.
Theoretical methods
In general, you can always neutralize categories by removing the corresponding neurons in the softmax layer and compute a new softmax layer only with the relevant rows of the matrix.
This method will surely work (maybe that is what you meant by filtering) but will not accelerate the network computation time by much, since most of the flops (multiplications and additions) will remain.
Similar to decision trees, pruning is possible but may reduce performance. I will explain what pruning means, but note that the accuracy over your categories may remain since you are not just trimming, you are predicting less categories as well.
Transfer the learning to your problem. See stanford's course in computer vision here. Most of the times I've seen that works good is by keeping the convolution layers as-is, and preparing a medium-size dataset of the objects you'd like to detect.
I will add more theoretical methods if you request, but the above are the most common and accurate I know.
Practical methods
Make sure you are serving your tensorflow model, and not just using an inference python code. This could significantly accelerate performance.
You can export the parameters of the network and load them in a faster framework such as CNTK or Caffe. These frameworks work in C++/CSharp and can inference much faster. Make sure you load the weights correctly, some frameworks use different order in tensor dimensions when saving/loading (little/big endian-like issues).
If your application perform inference on several images, you can distribute the computation via several GPUs. **This can also be done in tensorflow, see Using GPUs.
Pruning a neural network
Maybe this is the most interesting method of adapting big networks for simple tasks. You can see a beginner's guide here.
Pruning means that you remove parameters from your network, specifically the whole nodes/neurons in a decision tree/neural network (resp). To do that in object detection, you can do as follows (simplest way):
Randomly prune neurons from the fully connected layers.
Train one more epoch (or more) with low learning rate, only on objects you'd like to detect.
(optional) Perform the above several times for validation and choose best network.
The above procedure is the most basic one, but you can find plenty of papers that suggest algorithms to do so. For example
Automated Pruning for Deep Neural Network Compression and An iterative pruning algorithm for feedforward neural networks.

Cannot train a CNN (or other neural network) to fit (or even over-fit) on Keras + Theano

For learning purposes, I am trying to implement a CNN from scratch, but the results do not seem to improve from random guessing. I know this is not the best approach on home hardware, and following course.fast.ai I have obtained much better results via transfer learning, but for a deeper understanding I would like to see, at least in theory, how one could do it otherwise.
Testing on CIFAR-10 posed no issues - a small CNN trained from scratch in a matter of minutes with an error of less than 0.5%.
However, when trying to test against the Cats vs. Dogs Kaggle dataset, the results did not bulge from 50% accuracy. The architecture is basically a copy of AlexNet, including the non-state-of-the-art choices (large filters, histogram equalization, Nesterov-SGD optimizer). For more details, I put the code in a notebook on GitHub:
https://github.com/mspinaci/deep-learning-examples/blob/master/dogs_vs_cats_with_AlexNet.ipynb
(I also tried different architectures, more VGG-like and using Adam optimizer, but the result was the same; the reason why I followed the structure above was to match as closely as possible the Caffe procedure described here:
https://github.com/adilmoujahid/deeplearning-cats-dogs-tutorial
and that seems to converge quickly enough, according to the author's description here: http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/).
I was expecting some fitting to happen quickly, possibly flattening out due to the many suboptimal choices made (e.g. small dataset, no data augmentation). Instead, I saw no increment at all, as the notebook shows.
So I thought that maybe I was simply overestimating my GPU and patience, and that the model was too complicated even to overfit my data in a few hours (I ran 70 epochs, each time roughly 360 batches of 64 images). Therefore I tried to overfit as hard as I could, running these other models:
https://github.com/mspinaci/deep-learning-examples/blob/master/Ridiculously%20overfitting%20models...%20or%20maybe%20not.ipynb
The purely linear model started showing some overfit - around 53.5% training accuracy vs 52% validation accuracy (which I guess is thus my best result). That followed my expectations. However, to try and overfit as hard as I could, the second model is a simple 2 layers feedforward neural network, without any regularization, that I trained on just 2000 images with batch size up to 500. I was expecting the NN to overfit wildly, quickly getting to 100% train accuracy (after all it has 77M parameters for 2k pictures!). Instead, nothing happened, and the accuracy flattened to 50% quickly enough.
Any tip about why none of the "multi-layer" models seems able to pick any feature (be it "true" or out of overfitting) would be very much appreciated!
Note on versions etc: the notebooks were run on Python 2.7, Keras 2.0.8, Theano 0.9.0. The OS is Windows 10, and the GPU is a not-so-powerful, but that should be sufficient for basic tasks, GeForce GTX 960M.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.