Training in tensor flow cpu version too slow

Training in tensor flow cpu version too slow - python

I have installed Tensorflow cpu version.I have only few images as dataset and I am training on a machine with 4GB ram and Core i5 3340m 2.70GHZ with batch size 1 and it is still extremely slow.the size of all images is same (200X185 i think).Will it train like this ? kindly tell me how can I speed up this process?
Training porcess

If your network is deep, it could take a long time to train your network using CPU as it is not optimized like GPU for calculations.
I would suggest you to get a graphic card, even a old version of graphic card can significantly improve the performance (it could be like 100x faster).

Let's put some numbers here. You are dealing with images with a size of 200x185. Do you realize we are talking about 37000 features right? If we deal with gray levels. If we deal with RGB multiply that by 3. How many images are you using for training? Keep also in mind that SGD (Stochastic Gradient Descent, mini-batch size = 1) tend to be very slow for big datasets... Give us some numbers. How many training images and what is "slow". How much time for one epoch. Something else: programming languages, library (tensorflow, etc.), optimizer, etc. would help us in judging if your code is "slow" and can it be made faster.

batch size is another param affect training time: higher size will help reduce time each epoch, but will require more epoch to have the same effiency like size=1
And if your network is deep (using CNN, etc), you should run on GPU

Related

Allowing Tensorflow to Use both GPU and Physical System Memory

For a project I'm working on, I am using an altered version of Mask RCNN to train a model that will find objects in an image. These images are relatively small, about 300 x 200 pixels, and I train them for a relatively long time, around 100 epochs.
However, my main question relates to the batch size and how Tensorflow allocates memory on the gpu for the validation stage per epoch. I want to increase my batch size to help better smooth out the validation curve, as well as increase the accuracy of the overall model. However, if I increase my batch size to drastically, I get a OOM: GPU out-of-momory and keras_scratch_graph error. I'm currently working with two NVIDIA Quadro P5000s that have 16GB of vram each. having about 3 images per GPU, I can have a max batch size of 6 before it errors out. I've looked around and most people either say to just decrease the batch size, which I would prefer not to do, or enable GPU growth, which I couldn't get to work either. I could decrease the complexity of my model to decrease the size of tensors that are being evaluated, but I don't want to risk it as it could cause my accuracy to decrease, or loss to increase.
Is there a way that I can offset some images onto my physical systems memory, or am I purely limited to the amount of ram I have available on my GPU? Are their any more compact or robust methods out there that could solve this issue?

transfer learning with bad computer

I want to use transfer learning to classify image. That my first try using transfer learning. I curently use VGG16 model. Since my data are very different from image used for the original training model, theory told me I should train many layers, potentialy including hidden layers.
My computer has 8GO ram, using i5 2.40 Hz no gpu. My data set is small (3000 images), but data are stored as matrix in python memory, not saved in a folder. Almost all my RAM is takent by those images
Original VGG16 model has 130 million parameters. If I only take weight of hiden layer, and create 2 new (and small, size 512 and 256) fully connected layer at the end, I still have 15M parameter to train, for a total of 30m parameter.
I actually use image size of 224*224 like vgg16 input
My computer need 1H30 for 1 epoch. At 10 epoch I have a bad accuracy (50% vs 90% with conv net from scratch).
My question:
computer crash after X epoch, I don't know why. Could it be RAM problem? Since when vgg started to train for 1 epoch, and other epoch are just weight adjustement, other epoch should not impact memory?
Should I unfreeze input layer to use image of reduced dimension to reduce memory problem and training time? It'll not affect too much conv net performance?
Is it normal to need 1h30 to compute 1 epoch with 15M trainable parameter? Since I still need to find optimal number of layer to unfreeze, shape of new fully connected layer, learning rate, otpimizer... it's look impossible to me to optimise a transfer learning model with my curent commputing ressources in a decent amount of time
Do you have any tips to for transfer learning?
thanks

No specific tips for transfer learning, but if you are lacking on computing power, it might be helpful to consider transitioning to cloud resources. AWS, Google cloud, Azure or other services are available at really reasonable prices.
Most of them also provide some free resources, which can be enough for small ML projects or student tasks.
Notably:
Google colab provides a free GPU for a limited time
AWS provides ~ 250 hours of training per month on sagemaker
Azure notebooks also provides some free (but limited) computing power
Most of these services also provide free general compute power, on which you can also run ML tasks, but that might require some additional manual tweaking.

Why is Keras LSTM on CPU three times faster than GPU?

I use this notebook from Kaggle to run LSTM neural network.
I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.
CPU perfomance: 8 min per epoch;
GPU perfomance: 26 min per epoch.
After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM.
Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.
Questions:
1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.
2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?
P.S.:
My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)
My GPU: nVidia GeForce GTX 1050 Ti (4 GB)

Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.
In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.
The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.

I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):
Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.
Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.

In Keras, the fast LSTM implementation with CuDNN.
model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))
It can only be run on the GPU with the TensorFlow backend.

Training changing input size RNN on Tensorflow

I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?

As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.

Cannot train a CNN (or other neural network) to fit (or even over-fit) on Keras + Theano

For learning purposes, I am trying to implement a CNN from scratch, but the results do not seem to improve from random guessing. I know this is not the best approach on home hardware, and following course.fast.ai I have obtained much better results via transfer learning, but for a deeper understanding I would like to see, at least in theory, how one could do it otherwise.
Testing on CIFAR-10 posed no issues - a small CNN trained from scratch in a matter of minutes with an error of less than 0.5%.
However, when trying to test against the Cats vs. Dogs Kaggle dataset, the results did not bulge from 50% accuracy. The architecture is basically a copy of AlexNet, including the non-state-of-the-art choices (large filters, histogram equalization, Nesterov-SGD optimizer). For more details, I put the code in a notebook on GitHub:
https://github.com/mspinaci/deep-learning-examples/blob/master/dogs_vs_cats_with_AlexNet.ipynb
(I also tried different architectures, more VGG-like and using Adam optimizer, but the result was the same; the reason why I followed the structure above was to match as closely as possible the Caffe procedure described here:
https://github.com/adilmoujahid/deeplearning-cats-dogs-tutorial
and that seems to converge quickly enough, according to the author's description here: http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/).
I was expecting some fitting to happen quickly, possibly flattening out due to the many suboptimal choices made (e.g. small dataset, no data augmentation). Instead, I saw no increment at all, as the notebook shows.
So I thought that maybe I was simply overestimating my GPU and patience, and that the model was too complicated even to overfit my data in a few hours (I ran 70 epochs, each time roughly 360 batches of 64 images). Therefore I tried to overfit as hard as I could, running these other models:
https://github.com/mspinaci/deep-learning-examples/blob/master/Ridiculously%20overfitting%20models...%20or%20maybe%20not.ipynb
The purely linear model started showing some overfit - around 53.5% training accuracy vs 52% validation accuracy (which I guess is thus my best result). That followed my expectations. However, to try and overfit as hard as I could, the second model is a simple 2 layers feedforward neural network, without any regularization, that I trained on just 2000 images with batch size up to 500. I was expecting the NN to overfit wildly, quickly getting to 100% train accuracy (after all it has 77M parameters for 2k pictures!). Instead, nothing happened, and the accuracy flattened to 50% quickly enough.
Any tip about why none of the "multi-layer" models seems able to pick any feature (be it "true" or out of overfitting) would be very much appreciated!
Note on versions etc: the notebooks were run on Python 2.7, Keras 2.0.8, Theano 0.9.0. The OS is Windows 10, and the GPU is a not-so-powerful, but that should be sufficient for basic tasks, GeForce GTX 960M.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.