Is there a way to build following graph in Tensorflow:
Load some N images (N can vary for each set) using TF Queues and TF Image Readers.
Process these images to get fixed size image and prepare batches.
Feed these batches through the CNN model
Some questions/info:
I am trying to build data loading part in TF instead of Python functions and feed_dict. I guess, TF Data loading can train the model faster compared to python and feed_dict. Is that right ?
Building the graph for small N (N<5) is easy. Define exclusive nodes for each image in N and process on them. (working)
Can I use TF "while_loop" to build such functionality to read N images ??
Does Keras supports such functionality ?
Thanks for your suggestions.
I just did this last week! It was awesome, I learned a ton about tensorflow using things like tf.map_fn, and tf.cond. And it worked.
This week I just refactored my code to eliminate it all, because it was a bad idea.
Issues I ran into:
Doing preprocessing in tensorflow is messy to debug. Doing proper TDD will definitely benefit you here, but still not going to be particularly pretty or easy to debug.
You should be offloading the preprocessing to the CPU and leaving the GPU (assuming you're using one) to do training. A better approach is to just have a queue and load it from a thread/class that's dedicated to your preprocessing task. And doing the work in numpy/scikit/scikit-image is going to be easier to configure and test.
I thought I was so smart, corralling all my code into a single model. But the complexity of the preprocessing meant my model was really hard to iterate on, it got to be rigid code quickly - example is when I added my test set evaluation in, the preprocessing requirement was slightly different. Suddenly I had to add large sections of conditional code to my model and it got ugly quick.
That being said, my preprocessing steps were maybe more complex than yours. If you're sticking to simple things where you can just apply some of the simple image preprocessing steps it might still be easier for you to go this approach.
To answer your questions specifically:
Queues won't give any benefit over feed_dict that I know of. You still have a problem of moving data from a TF queue on the CPU to the GPU memory each iteration same as feed_dict does, watch this thread if you care about that topic, GPU queues are coming: https://github.com/tensorflow/tensorflow/issues/7679
You should just dequeue_many from the queue, process them as a batch. If you need to do something to each individual image just use tf.map_fn which will remove the first dimension and pass individual 3D images to your specified function. But heed my warning above when you go this route - you'll probably be happier just doing this in a separate thread.
Already answered in #2, use tf.map_fn to iterate over multiple images in a batch. it's pretty easy to use actually.
I don't know Keras.
Related
when I run fit() with multiprocessing=True i always get a deadlock and the following warning:
WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
how to run it properly?
Since it says "tf.data", i wonder if transforming my data into this format will make multiprocessing work. What specifically is meant/how to convert it?
my dataset: (reproducable)
Input_shape, labels =(20,4), 6
LEN_X.LEN_Y = 20000.3000
train_X,train_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_X )]), np.random.random((LEN_X ,labels))
validation_X,validation_Y = np.asarray([np.random.random(Input_shape) for x in range(LEN_Y)]), np.random.random((LEN_Y,labels))
sampleW = np.random.random((LEN_X ,1))
The multiprocessing doesn't accelerate the model itself. It only accelerates the data loading. And data loading delay is not a problem when all your data is already in-memory.
You could still use multiprocessing, however, but you must make sure that the underlying dataset is thread-safe and you have to carefully craft the data pipeline. That is quite time consuming. So, instead I suggest you speed up the model itself.
For that, you should look into:
changing all except last layer activations to RELU.
tweaking batch size. (optimal number depends on your hardware, and is almost always less than or equal to 32)
using Batch normalization to speed up convergence.
using higher learning rate (be careful not to overdo this step).
if you need faster convolutions, consider using Kaggle notebooks or vast.ai for GPU-enabled computations.
last but not least, try using a simpler, smaller model.
Comment down here if you have any additional questions.
Cheers.
Obscure question, I suppose, but I have a PyTorch model that I've just built, and it's failing to convert to CoreML because ONNX has added Gather ops. The complete model is actually an amalgamation of two separate models, intended to improve performance by keeping the processing on the GPU/Metal for as long as possible.
Building this "composite" model required me to create a couple of slices, of the form x = y[:, 0], and I'm wondering if these might be the reason for the Gather ops?
I do realize I can create a custom layer, but I've just been through a horrible fiasco with custom layers in CoreML, that wasted many, many hours, and got me nowhere, so I'm trying to find another way around the problem.
If finding a way around those slices would prevent ONNX from adding Gather I'd be willing to search for a solution.
Any thoughts appreciated.
I am new to Tensorflow and I am working on distributing testing images to multiple GPUs. I have read a lot of Stack overflow answers and Github examples, and I think there might be two ways to do that.
1) using tf.FIFOQueue() to feed each GPU images, however the queue is not recommended in a lot of answers (due to the new tf.data API). And it has some issues (https://github.com/tensorflow/tensorflow/issues/8061)
2) using tf.data API. I am not sure if this API support GPU or not. In this issue (https://github.com/tensorflow/tensorflow/issues/13610), it seems that input pipeline with tf.data API can not support GPU feeding yet.
Distributed Tensorflow is not within my consideration (since our model and scale of server is not that large)
I will appreciate it very much if some one can give me any advice.
Use tf.data. The tf.data API is meant to replace almost every functionality of queues and makes everything easier and more performant.
It can also feed data to the GPU. The second issue you link to just says that the preprocessing will not happen on the GPU, but data will be processed on CPU and then sent to your multiple GPUs.
I came across TensorFlow's Deep MNIST for Experts and wanted to adapt it for more efficient use on GPUs. Since feed_dict seems to be incredibly slow, I implemented an input pipeline using tf.train.shuffle_batch and a FIFOQueue to feed data into the model.
Here's a Gist with the stock implementation of the TensorFlow guide and here's a Gist with my attempt at an optimized implementation.
Now in the example on the TensorFlow page, the accuracy pretty quickly approaches 1 after a few thousand iterations. However in my code, which aside from the queue implementation is the same model, the accuracy seems to oscillate between ~0.05 and ~0.15. Further, the loss reaches about 2.3 after a couple hundred iterations and doesn't decrease much farther than that.
Another noteworthy point: when I make a comparison to the original batch created and the batch used in subsequent iterations, they appear to be equivalent. Perhaps the issue lies in my queuing/dequeuing but I'm not really sure how to fix it. If anyone sees any issues with my implementation some pointers would be greatly appreciated!
Found the solution. Turns out tf.train.shuffle_batch implicitly implements a RandomShuffleQueue. Loading the results of tf.train.shuffle_batch into a FIFOQueue presumably caused the FIFOQueue to not update the input batch, while the labels were being updated because they weren't being passed into the FIFOQueue. Removing the FIFOQueue entirely solved my issue.
In Tensorflow, it seems that preprocessing could be done on either during training time, when the batch is created from raw images (or data), or when the images are already static. Given that theoretically, the preprocessing should take roughly equal time (if they are done using the same hardware), is there any practical disadvantage in doing data preprocessing (or even data augmentation) before training than during training in real-time?
As a side question, could data augmentation even be done in Tensorflow if was not done during training?
Is there any practical disadvantage in doing data preprocessing (or
even data augmentation) before training than during training in
real-time?
Yes, there are advantages (+++) and disadvantages (---):
Preprocessing before training:
--- preprocessed samples need to be stored: disk space consumption* (1)
--- only a "finite" amount of samples can be generated
+++ no runtime during training
---... but samples always need be read from storage, i.e. maybe storage (disk) I/O becomes bottleneck
--- not flexible: changing datset/augmentation requires generating a new augmented dataset
+++ for Tensorflow: Easily work on numpy.ndarray or other dataformats with any high-level image API (open-cv, PIL, ...) to do augmentation or even use any other language/tool you like.
Preprocessing during training ("real-time"):
+++ infinite amount of samples can be generated (as it is generated on-the-fly)
+++ flexible: changing dataset/augmentation only requires changing code
+++ if dataset fits in memory, no disk I/O needed for data after reading once
--- adds runtime to your training* (2)
--- for Tensorflow: Building the preprocessing as part of the graph requires working with Tensors and restricts usage of APIs working on ndarrays or other formats.* (3)
Some specific aspects discussed in detail:
(1) Reproducing experiments "with the same data" is kind of straightforward with a dataset generated before training. However this can be solved (even more!) elegantly with storing a seed for real-time data generation.
(2): Training runtime for preprocessing: There are ways to avoid an expensive preprocessing pipeline to get in the way of your actual training. Tensorflow itself recommends filling Queues with many (CPU-)threads so that data generation can independently keep up with GPU data consumption. You can read more about this in the input pipeline performance guide.
(3): Data augmentation in tensorflow
As a side question, could data augmentation even be done in Tensorflow
if was not done during (I think you mean) before training?
Yes, tensorflow offers some functionality to do augmentation. In terms of value augmentation of scalar/vector (or also more dimensional data), you can easily build something yourself with tf.multiply or other basic math ops. For image data, there are several ops implemented (see tf.image and tf.contrib.image), which should cover a lot of augmentation needs.
There are off-the-shelf preprocessing examples on github, one of which is used and described in the CNN tutorial (cifar10).
Personally, I would always try to use real-time preprocessing, as generating (potentially huge) datasets feels clunky. But it is perfectly viable, I've seen it done many times and (as you see above) it definitely has it's advantages.
I have been wondering the same thing and have been disappointed with my during-training-time image processing performance. It has taken me a while to appreciate quite how big an overhead the image manipulation can be.
I am going to make myself a nice fat juicy preprocessed/augmented data file. Run it overnight and then come in the next day and be twice as productive!
I am using a single GPU machine and it seems obvious to me that piece-by-piece model building is the way to go. However, the workflow-maths may look different if you have different hardware. For example, on my Macbook-Pro tensorflow was slow (on CPU) and image processing was blinding fast because it was automatically done on the laptop's GPU. Now I have moved to a proper GPU machine, tensorflow is running 20x faster and the image processing is the bottleneck.
Just work out how long your augmentation/preprocessing is going to take, work out how often you are going to reuse it and then do the maths.