Suppose I have complex matrix manipulation routine and I have implemented it as Tensorflow graph. Now suppose I want to run this graph against bunch of data, located outside Tensorflow. Suppose I can't represent this data as N-d tensors in Tensorflow and obliged to feed matrices by one.
Can I gain of GPU in this situation?
Can I do calculations in parallel? Some sort of par-for or SIMD? So that Tensorflow made a queue and allocate GPU resources as needed?
Related
I have a saved_model formatted tensorflow model with size of over 10GB. The computation graph consists of two parts, the upstream part is mainly sparse embedding matrix which accounts mostly for the large size, the downstream part is dnn. The first part is memory-intensive and it's better run on CPU, the second part contains ops that can be optimized on GPU. The two parts are divided by ops of Gather SpaseSegmentSum and SparseSegmentMean.
My question is how can I divide these two parts into two saved models, so that I can deploy them separately on CPU devices and GPU devices. What's the best practice on this topic? Is there any implementation example?
I work on medical imaging so I need to train massive 3D CNNs that are difficult to fit into one GPU. I wonder if there is way to split a massive Keras or TensorFlow graph amongst multiple GPUs such that each GPU only runs a small part of the graph during training and inference. Is this type of distribute training possible with either Keras or TensorFlow?
I have tried using with tf.device('\gpu:#') when building the graph but I am experiencing memory overflow. The logs seem to indicate the entire graph is still being run on gpu:0.
Now, I am running MTCNN(implement on Tensorflow) for face recognition on the GPU.
Since MTCNN using three models, PNet, RNet, ONet, and between them, running some steps by NumPy.
For example, when get the output from PNet, it will do some numpy.transpose on the output to get boxes, then pass these boxes to RNet.
So I suppose only PNet, RNet and ONet models would be run on GPU, other NumPy steps would be on CPU. Then it will copy the output from GPU memory to main memory. It would be quite waste time.
Is my guess right?
For improve the performance, I want to put all MTCNN calculations on GPU.
Anyone would give me some idea or an example?
First of all, the memcopy between GPU and CPU can be not that expensive, I would measure this first.
If you use TensorFlow-GPU, all the operations will be automatically performed on GPU whenever possible. Of course, this will not work with NumPy steps. So, the way to go would be replace the NumPy routines with the TensorFlow ones. TF has quite rich tensor manipulation library, which is similar in API to NumPy. Compare, for instance np.transpose to tf.transpose.
I am starting with deep learning stuff using keras and tensorflow. At very first stage i am stuck with a doubt. when I use tf.contrib.layers.flatten (Api 1.8) for flattening a image (could be multichannel as well).
How is this different than using flatten function from numpy?
How does this affect the training. I can see the tf.contrib.layers.flatten is taking longer time than numpy flatten. Is it doing something more?
This is a very close question but here the accepted answer includes Theano and does not solve my doubts exactly.
Example:
Lets say i have a training data of (10000,2,96,96) shape. Now I need the output to be in (10000,18432) shape. I can do this using tensorflow flatten or by using numpy flatten like
X_reshaped = X_train.reshape(*X_train.shape[:1], -2)
what difference does it make in training and which is the best practice?
The biggest difference between np.flatten and tf.layers.flatten (or tf.contrib.layers.flatten) is that numpy operations are applicable only to static nd arrays, while tensorflow operations can work with dynamic tensors. Dynamic in this case means that the exact shape will be known only at runtime (either training or testing).
So my recommendation is pretty simple:
If the input data is static numpy array, e.g. in pre-processing, use np.flatten. This avoids unnecessary overhead and returns numpy array as well.
If the data is already a tensor, use any of the flatten ops provided by tensorflow. Between those, tf.layers.flatten is better choice since tf.layers API is more stable than tf.contrib.*.
Use numpy directly on your data, without participation of a neural network. This is for preprocessing and postprocessing only
Use TF or Keras layers inside models if this operation is needed for some reason in the model. This will assure model connectivity and proper backpropagation
Models are symbolic graphs meant to create Neural Networks that can be trained. There will be a proper connection and backpropagation will work properly when you have a graph connected from input to output.
If you don't intend to create a network, don't use a TF layer. If your goal just to flatten an array, you don't need a neural network.
Now if inside a model you need to change the format of the data without losing connection and backpropagation, then go for the flatten layer.
The flatten function in numpy does a complete array flattening, meaning that you end up with a single axis of data (1 dimension only).
For example,
import numpy as np
a = np.arange(20).reshape((5,4))
print(a)
print(a.flatten().shape)
In the previous example, you end up with a 1-d array of 20 elements.
In tensorflow, the flatten layer (tf.layers.flatten) preserves the batch axis (axis 0). In the previous example, with tensorflow, you would still have a shape of (5,4).
In any case, there is no effect on training if you use flatten in an equivalent way. However, you should avoid using numpy when working with tensorflow, since almost all numpy operations have their tensorflow counterparts. Tensorflow and numpy rely on different runtime libraries and combining both could be runtime inefficient.
Moreover, avoid using contrib package layers, when they already exist in the main package (use tf.layers.flatten instead of tf.contrib.layers.flatten).
For a more general performance comparison between numpy and tensorflow, have a look at this question: Tensorflow vs. Numpy Performance
Difference
When you use tensorflow flatten, it gets added as an operation (op) in the graph. It can operate only on tensors. Numpy on the other hand works on actual numpy arrays. The usage is completely different.
Usage
You would use tensorflow op if this is an operation in the training process such as resizing before feeding to the next layer.
You would use numpy op when you want to operate on actual value at that time, like reshaping for calculating accuracy at the end of training step.
So if you had a task of
tensor A -> reshape -> matrix_mul
If you use tensorflow for reshape, you can directly run the matrix_mul from session.
If you use numpy however, you'd have to run the operation in two stages (two session calls).
You calculate tensor A
You reshape it in numpy.
Run the matrix_mul by "feeding" in reshaped array.
Performance
I haven't benchmarked anything but I'd say for just a reshape operation as standalone, numpy would be faster (ignoring gpu ) , but in a process where reshape is an intermediate op, tensorflow should be faster.
I am new in Theano and Deep Learning, I am running my experiments in Theano but I would like to reduce the time I spend per epoch by doing data augmentation directly using the GPU.
Unfortunately I can not use PyCuda, so I would like to know if is possible to do basic Data Augmentation using Theano. For example Translation or Rotation in images, meanwhile I am using scipy functions in CPU using Numpy but it is quite slow.
If the data augmentation is part of your computation graph, and can be executed on GPU, it will naturally be executed on the GPU. So the question narrows down to "is it possible to do common data augmentation tasks using Theano tensor operations on the GPU".
If the transformations you want to apply are just translations, you can just use theano.tensor.roll followed by some masking. If you want the rotations as well, take a look at this implementation of spatial transformer network. In particular take a look at the _transform function, it takes as an input a matrix theta that has a 2x3 transformation (left 2x2 is rotation, and right 1x2 is translation) one per sample and the actual samples, and applies the rotation and translation to those samples. I didn't confirm that what it does is optimized for the GPU (i.e. it could be that the bottleneck of that function is executed on the CPU, which will make it not appropriate for your use case), but it's a good starting point.