I work on medical imaging so I need to train massive 3D CNNs that are difficult to fit into one GPU. I wonder if there is way to split a massive Keras or TensorFlow graph amongst multiple GPUs such that each GPU only runs a small part of the graph during training and inference. Is this type of distribute training possible with either Keras or TensorFlow?
I have tried using with tf.device('\gpu:#') when building the graph but I am experiencing memory overflow. The logs seem to indicate the entire graph is still being run on gpu:0.
Related
I am already using Google Colab to train my model. So I will not use my own GPU for training. I want to ask, is there a performance difference beetween GPU and CPU while working with pre-trained model. I already trained a model with Google Colab GPU and used with my own local CPU. Should I use GPU for testing?
It depends how many predictions you need to do. Usually in training you are making many calculations therefore parallelisation by GPU shortens overall training time. Usually, when using a trained model you just need to do a sparse prediction per time unit. In such situation CPU approach should be OK. However, if you need to do as many predictions as during training then GPU would be beneficial. This can particularly be true with reinforcement training, when your model must adopt to continuously changing environmental input.
I have an implementation of a GRU based network in PyTorch, which I train using a 4 GB GPU present in my laptop, and obviously it takes a lot of time (4+ hrs for 1 epoch). I am looking for ideas/leads on how I can move this deep-learning model to train on a couple of spark clusters instead.
So far, I have only come across this GitHub library called SparkTorch, which unfortunately has limited documentation and the examples provided are way too trivial.
https://github.com/dmmiller612/sparktorch
To summarize, I am looking for answers to the following two questions:
Is it a good idea to train a deep learning model on spark clusters, since I read at places that the communication overhead undermines the gains in training speed
How to convert the PyTorch model (and the underlying dataset) in order to perform a distributed training across the worker nodes.
Any leads appreciated.
I am trying to use ssd_inception_v2_coco pre-trained model from Tensorflow API by training it with a single class dataset and also applying Transfer Learning. I trained the net for around 20k steps(total loss around 1) and using the checkpoint data, I created the inference_graph.pb and used it in the detection code.
To my surprise, when I tested the net with the training data the graph is not able to detect even 1 out of 11 cases (0/11). I am lost in finding the issue.
What might be the possibile mistake?.
P.S : I am not able to run train.py and eval.py at the same time, due to memory issues. So, I don't have info about precision from tensorboard
Has anyone faced similar kind of issue?
For learning purposes, I am trying to implement a CNN from scratch, but the results do not seem to improve from random guessing. I know this is not the best approach on home hardware, and following course.fast.ai I have obtained much better results via transfer learning, but for a deeper understanding I would like to see, at least in theory, how one could do it otherwise.
Testing on CIFAR-10 posed no issues - a small CNN trained from scratch in a matter of minutes with an error of less than 0.5%.
However, when trying to test against the Cats vs. Dogs Kaggle dataset, the results did not bulge from 50% accuracy. The architecture is basically a copy of AlexNet, including the non-state-of-the-art choices (large filters, histogram equalization, Nesterov-SGD optimizer). For more details, I put the code in a notebook on GitHub:
https://github.com/mspinaci/deep-learning-examples/blob/master/dogs_vs_cats_with_AlexNet.ipynb
(I also tried different architectures, more VGG-like and using Adam optimizer, but the result was the same; the reason why I followed the structure above was to match as closely as possible the Caffe procedure described here:
https://github.com/adilmoujahid/deeplearning-cats-dogs-tutorial
and that seems to converge quickly enough, according to the author's description here: http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/).
I was expecting some fitting to happen quickly, possibly flattening out due to the many suboptimal choices made (e.g. small dataset, no data augmentation). Instead, I saw no increment at all, as the notebook shows.
So I thought that maybe I was simply overestimating my GPU and patience, and that the model was too complicated even to overfit my data in a few hours (I ran 70 epochs, each time roughly 360 batches of 64 images). Therefore I tried to overfit as hard as I could, running these other models:
https://github.com/mspinaci/deep-learning-examples/blob/master/Ridiculously%20overfitting%20models...%20or%20maybe%20not.ipynb
The purely linear model started showing some overfit - around 53.5% training accuracy vs 52% validation accuracy (which I guess is thus my best result). That followed my expectations. However, to try and overfit as hard as I could, the second model is a simple 2 layers feedforward neural network, without any regularization, that I trained on just 2000 images with batch size up to 500. I was expecting the NN to overfit wildly, quickly getting to 100% train accuracy (after all it has 77M parameters for 2k pictures!). Instead, nothing happened, and the accuracy flattened to 50% quickly enough.
Any tip about why none of the "multi-layer" models seems able to pick any feature (be it "true" or out of overfitting) would be very much appreciated!
Note on versions etc: the notebooks were run on Python 2.7, Keras 2.0.8, Theano 0.9.0. The OS is Windows 10, and the GPU is a not-so-powerful, but that should be sufficient for basic tasks, GeForce GTX 960M.
The problem:
I have a model that I would like to train with independent data sets. Afterwards, I would like to extract the weights of each model (the model is the same for each instance but trained using different datasets) and finally, compute and average of these weights. Basically, my intention is to mimic tensorflow running on multiple devices and then average their weights so that they are used by one model.
My solution:
I added this model multiple times to tensorflow and am currently training each of these models separately with its unique dataset.. but this is using GBs of memory, and am wondering if there is a better way to do this?
One of the possible solutions is that you can fine-tune your network weights with other similar networks(similar datasets, i.e, if your dataset is images, you can use AlexNet weights)don't afraid if your network has no same architecture, you can simply load weights of layers as much as you need by 'load_with_skip' function of
https://github.com/joelthchao/tensorflow-finetune-flickr-style/blob/master/network.py
Fine-tuning takes much less than train networks from scratch.