Now, I am running MTCNN(implement on Tensorflow) for face recognition on the GPU.
Since MTCNN using three models, PNet, RNet, ONet, and between them, running some steps by NumPy.
For example, when get the output from PNet, it will do some numpy.transpose on the output to get boxes, then pass these boxes to RNet.
So I suppose only PNet, RNet and ONet models would be run on GPU, other NumPy steps would be on CPU. Then it will copy the output from GPU memory to main memory. It would be quite waste time.
Is my guess right?
For improve the performance, I want to put all MTCNN calculations on GPU.
Anyone would give me some idea or an example?
First of all, the memcopy between GPU and CPU can be not that expensive, I would measure this first.
If you use TensorFlow-GPU, all the operations will be automatically performed on GPU whenever possible. Of course, this will not work with NumPy steps. So, the way to go would be replace the NumPy routines with the TensorFlow ones. TF has quite rich tensor manipulation library, which is similar in API to NumPy. Compare, for instance np.transpose to tf.transpose.
Related
I use torch to build a model. In the training loop, I need to compute a dictionary of values which will happen on GPU. Then, every few iterations, I need to use the dictionary which was on GPU to perform some other tasks on CPU. So there is a back and forth between the two. Currently, I use torch.save and torch.load to save and load the said dictionary which is very slow and not thread-safe.
I am almost sure that there is a better way to accomplish what I am trying to do. I understand that copying data to/from GPU causes slowdown but I am looking for a strategy that does not involve the disk.
I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?
As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.
Suppose I have complex matrix manipulation routine and I have implemented it as Tensorflow graph. Now suppose I want to run this graph against bunch of data, located outside Tensorflow. Suppose I can't represent this data as N-d tensors in Tensorflow and obliged to feed matrices by one.
Can I gain of GPU in this situation?
Can I do calculations in parallel? Some sort of par-for or SIMD? So that Tensorflow made a queue and allocate GPU resources as needed?
Is there a way to build following graph in Tensorflow:
Load some N images (N can vary for each set) using TF Queues and TF Image Readers.
Process these images to get fixed size image and prepare batches.
Feed these batches through the CNN model
Some questions/info:
I am trying to build data loading part in TF instead of Python functions and feed_dict. I guess, TF Data loading can train the model faster compared to python and feed_dict. Is that right ?
Building the graph for small N (N<5) is easy. Define exclusive nodes for each image in N and process on them. (working)
Can I use TF "while_loop" to build such functionality to read N images ??
Does Keras supports such functionality ?
Thanks for your suggestions.
I just did this last week! It was awesome, I learned a ton about tensorflow using things like tf.map_fn, and tf.cond. And it worked.
This week I just refactored my code to eliminate it all, because it was a bad idea.
Issues I ran into:
Doing preprocessing in tensorflow is messy to debug. Doing proper TDD will definitely benefit you here, but still not going to be particularly pretty or easy to debug.
You should be offloading the preprocessing to the CPU and leaving the GPU (assuming you're using one) to do training. A better approach is to just have a queue and load it from a thread/class that's dedicated to your preprocessing task. And doing the work in numpy/scikit/scikit-image is going to be easier to configure and test.
I thought I was so smart, corralling all my code into a single model. But the complexity of the preprocessing meant my model was really hard to iterate on, it got to be rigid code quickly - example is when I added my test set evaluation in, the preprocessing requirement was slightly different. Suddenly I had to add large sections of conditional code to my model and it got ugly quick.
That being said, my preprocessing steps were maybe more complex than yours. If you're sticking to simple things where you can just apply some of the simple image preprocessing steps it might still be easier for you to go this approach.
To answer your questions specifically:
Queues won't give any benefit over feed_dict that I know of. You still have a problem of moving data from a TF queue on the CPU to the GPU memory each iteration same as feed_dict does, watch this thread if you care about that topic, GPU queues are coming: https://github.com/tensorflow/tensorflow/issues/7679
You should just dequeue_many from the queue, process them as a batch. If you need to do something to each individual image just use tf.map_fn which will remove the first dimension and pass individual 3D images to your specified function. But heed my warning above when you go this route - you'll probably be happier just doing this in a separate thread.
Already answered in #2, use tf.map_fn to iterate over multiple images in a batch. it's pretty easy to use actually.
I don't know Keras.
I am new in Theano and Deep Learning, I am running my experiments in Theano but I would like to reduce the time I spend per epoch by doing data augmentation directly using the GPU.
Unfortunately I can not use PyCuda, so I would like to know if is possible to do basic Data Augmentation using Theano. For example Translation or Rotation in images, meanwhile I am using scipy functions in CPU using Numpy but it is quite slow.
If the data augmentation is part of your computation graph, and can be executed on GPU, it will naturally be executed on the GPU. So the question narrows down to "is it possible to do common data augmentation tasks using Theano tensor operations on the GPU".
If the transformations you want to apply are just translations, you can just use theano.tensor.roll followed by some masking. If you want the rotations as well, take a look at this implementation of spatial transformer network. In particular take a look at the _transform function, it takes as an input a matrix theta that has a 2x3 transformation (left 2x2 is rotation, and right 1x2 is translation) one per sample and the actual samples, and applies the rotation and translation to those samples. I didn't confirm that what it does is optimized for the GPU (i.e. it could be that the bottleneck of that function is executed on the CPU, which will make it not appropriate for your use case), but it's a good starting point.