I am new in Theano and Deep Learning, I am running my experiments in Theano but I would like to reduce the time I spend per epoch by doing data augmentation directly using the GPU.
Unfortunately I can not use PyCuda, so I would like to know if is possible to do basic Data Augmentation using Theano. For example Translation or Rotation in images, meanwhile I am using scipy functions in CPU using Numpy but it is quite slow.
If the data augmentation is part of your computation graph, and can be executed on GPU, it will naturally be executed on the GPU. So the question narrows down to "is it possible to do common data augmentation tasks using Theano tensor operations on the GPU".
If the transformations you want to apply are just translations, you can just use theano.tensor.roll followed by some masking. If you want the rotations as well, take a look at this implementation of spatial transformer network. In particular take a look at the _transform function, it takes as an input a matrix theta that has a 2x3 transformation (left 2x2 is rotation, and right 1x2 is translation) one per sample and the actual samples, and applies the rotation and translation to those samples. I didn't confirm that what it does is optimized for the GPU (i.e. it could be that the bottleneck of that function is executed on the CPU, which will make it not appropriate for your use case), but it's a good starting point.
Related
I am starting with deep learning stuff using keras and tensorflow. At very first stage i am stuck with a doubt. when I use tf.contrib.layers.flatten (Api 1.8) for flattening a image (could be multichannel as well).
How is this different than using flatten function from numpy?
How does this affect the training. I can see the tf.contrib.layers.flatten is taking longer time than numpy flatten. Is it doing something more?
This is a very close question but here the accepted answer includes Theano and does not solve my doubts exactly.
Example:
Lets say i have a training data of (10000,2,96,96) shape. Now I need the output to be in (10000,18432) shape. I can do this using tensorflow flatten or by using numpy flatten like
X_reshaped = X_train.reshape(*X_train.shape[:1], -2)
what difference does it make in training and which is the best practice?
The biggest difference between np.flatten and tf.layers.flatten (or tf.contrib.layers.flatten) is that numpy operations are applicable only to static nd arrays, while tensorflow operations can work with dynamic tensors. Dynamic in this case means that the exact shape will be known only at runtime (either training or testing).
So my recommendation is pretty simple:
If the input data is static numpy array, e.g. in pre-processing, use np.flatten. This avoids unnecessary overhead and returns numpy array as well.
If the data is already a tensor, use any of the flatten ops provided by tensorflow. Between those, tf.layers.flatten is better choice since tf.layers API is more stable than tf.contrib.*.
Use numpy directly on your data, without participation of a neural network. This is for preprocessing and postprocessing only
Use TF or Keras layers inside models if this operation is needed for some reason in the model. This will assure model connectivity and proper backpropagation
Models are symbolic graphs meant to create Neural Networks that can be trained. There will be a proper connection and backpropagation will work properly when you have a graph connected from input to output.
If you don't intend to create a network, don't use a TF layer. If your goal just to flatten an array, you don't need a neural network.
Now if inside a model you need to change the format of the data without losing connection and backpropagation, then go for the flatten layer.
The flatten function in numpy does a complete array flattening, meaning that you end up with a single axis of data (1 dimension only).
For example,
import numpy as np
a = np.arange(20).reshape((5,4))
print(a)
print(a.flatten().shape)
In the previous example, you end up with a 1-d array of 20 elements.
In tensorflow, the flatten layer (tf.layers.flatten) preserves the batch axis (axis 0). In the previous example, with tensorflow, you would still have a shape of (5,4).
In any case, there is no effect on training if you use flatten in an equivalent way. However, you should avoid using numpy when working with tensorflow, since almost all numpy operations have their tensorflow counterparts. Tensorflow and numpy rely on different runtime libraries and combining both could be runtime inefficient.
Moreover, avoid using contrib package layers, when they already exist in the main package (use tf.layers.flatten instead of tf.contrib.layers.flatten).
For a more general performance comparison between numpy and tensorflow, have a look at this question: Tensorflow vs. Numpy Performance
Difference
When you use tensorflow flatten, it gets added as an operation (op) in the graph. It can operate only on tensors. Numpy on the other hand works on actual numpy arrays. The usage is completely different.
Usage
You would use tensorflow op if this is an operation in the training process such as resizing before feeding to the next layer.
You would use numpy op when you want to operate on actual value at that time, like reshaping for calculating accuracy at the end of training step.
So if you had a task of
tensor A -> reshape -> matrix_mul
If you use tensorflow for reshape, you can directly run the matrix_mul from session.
If you use numpy however, you'd have to run the operation in two stages (two session calls).
You calculate tensor A
You reshape it in numpy.
Run the matrix_mul by "feeding" in reshaped array.
Performance
I haven't benchmarked anything but I'd say for just a reshape operation as standalone, numpy would be faster (ignoring gpu ) , but in a process where reshape is an intermediate op, tensorflow should be faster.
I want to train an RNN with different input size of sentence X, without padding. The logic used for this is that I am using Global Variables and for every step, I take an example, write the forward propagation i.e. build the graph, run the optimizer and then repeat the step again with another example. The program is extremely slow as compared to the numpy implementation of the same thing where I have implemented forward and backward propagation and using the same logic as above. The numpy implementation takes a few seconds while Tensorflow is extremely slow. Can running the same thing on GPU will be useful or I am doing some logical mistake ?
As a general guideline, GPU boosts performance only if you have calculation intensive code and little data transfer. In other words, if you train your model one instance at a time (or on small batch sizes) the overhead for data transfer to/from GPU can even make your code run slower! But if you feed in a good chunk of samples, then GPU will definitely boost your code.
When preparing train set for neural network training, I find two possible way.
The traditional way: calculate the mean on whole training set, and minus this fixed mean value per image before sending to network. Processing standard deviation in the similar way.
I find tensorflow provides a function tf.image.per_image_standardization that do normalization on single image.
I wonder which way is more appropriate?
Both ways are possible and the choice mostly depends on the way you read the data.
Whole training set normalization is convenient when you can load the whole dataset at once into a numpy array. E.g., MNIST dataset is usually loaded fully into memory. This way is also preferable in terms of convergence, when the individual images vary significantly: two training images, one is mostly white and the other is mostly black, will have very different means.
Per image normalization is convenient when the images are loaded one by one or in small batches, for example from the TFRecord. It's also the only viable option when the dataset is too large too fit in memory. In this case, it's better to organize the input pipeline in tensorflow and transform the image tensors just like other tensors in the graph. I've seen pretty good accuracy with this normalization in CIFAR-10, so it's a viable way, despite the issues stated earlier. Also note that you can reduce the negative effect via batch normalization.
Suppose I have complex matrix manipulation routine and I have implemented it as Tensorflow graph. Now suppose I want to run this graph against bunch of data, located outside Tensorflow. Suppose I can't represent this data as N-d tensors in Tensorflow and obliged to feed matrices by one.
Can I gain of GPU in this situation?
Can I do calculations in parallel? Some sort of par-for or SIMD? So that Tensorflow made a queue and allocate GPU resources as needed?
In Tensorflow, it seems that preprocessing could be done on either during training time, when the batch is created from raw images (or data), or when the images are already static. Given that theoretically, the preprocessing should take roughly equal time (if they are done using the same hardware), is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?
As a side question, could data augmentation even be done in Tensorflow if was not done during training?
Is there any practical disadvantage in doing data preprocessing (or
even data augmentation) before training than during training in
real-time?
Yes, there are advantages (+++) and disadvantages (---):
Preprocessing before training:
--- preprocessed samples need to be stored: disk space consumption* (1)
--- only a "finite" amount of samples can be generated
+++ no runtime during training
---... but samples always need be read from storage, i.e. maybe storage (disk) I/O becomes bottleneck
--- not flexible: changing datset/augmentation requires generating a new augmented dataset
+++ for Tensorflow: Easily work on numpy.ndarray or other dataformats with any high-level image API (open-cv, PIL, ...) to do augmentation or even use any other language/tool you like.
Preprocessing during training ("real-time"):
+++ infinite amount of samples can be generated (as it is generated on-the-fly)
+++ flexible: changing dataset/augmentation only requires changing code
+++ if dataset fits in memory, no disk I/O needed for data after reading once
--- adds runtime to your training* (2)
--- for Tensorflow: Building the preprocessing as part of the graph requires working with Tensors and restricts usage of APIs working on ndarrays or other formats.* (3)
Some specific aspects discussed in detail:
(1) Reproducing experiments "with the same data" is kind of straightforward with a dataset generated before training. However this can be solved (even more!) elegantly with storing a seed for real-time data generation.
(2): Training runtime for preprocessing: There are ways to avoid an expensive preprocessing pipeline to get in the way of your actual training. Tensorflow itself recommends filling Queues with many (CPU-)threads so that data generation can independently keep up with GPU data consumption. You can read more about this in the input pipeline performance guide.
(3): Data augmentation in tensorflow
As a side question, could data augmentation even be done in Tensorflow
if was not done during (I think you mean) before training?
Yes, tensorflow offers some functionality to do augmentation. In terms of value augmentation of scalar/vector (or also more dimensional data), you can easily build something yourself with tf.multiply or other basic math ops. For image data, there are several ops implemented (see tf.image and tf.contrib.image), which should cover a lot of augmentation needs.
There are off-the-shelf preprocessing examples on github, one of which is used and described in the CNN tutorial (cifar10).
Personally, I would always try to use real-time preprocessing, as generating (potentially huge) datasets feels clunky. But it is perfectly viable, I've seen it done many times and (as you see above) it definitely has it's advantages.
I have been wondering the same thing and have been disappointed with my during-training-time image processing performance. It has taken me a while to appreciate quite how big an overhead the image manipulation can be.
I am going to make myself a nice fat juicy preprocessed/augmented data file. Run it overnight and then come in the next day and be twice as productive!
I am using a single GPU machine and it seems obvious to me that piece-by-piece model building is the way to go. However, the workflow-maths may look different if you have different hardware. For example, on my Macbook-Pro tensorflow was slow (on CPU) and image processing was blinding fast because it was automatically done on the laptop's GPU. Now I have moved to a proper GPU machine, tensorflow is running 20x faster and the image processing is the bottleneck.
Just work out how long your augmentation/preprocessing is going to take, work out how often you are going to reuse it and then do the maths.