Tensorflow ConcatOp Error with Object Detection API - python

I'm following tensorflow object detection api instructions and trying to train existing object-detection model("faster_rcnn_resnet101_coco") with my own dataset having 50 classes.
So according to my own dataset, I created
TFRecord (FOR training,evaluation and testing separately)
labelmap.pbtxt
Next, I edited model.config only for model-faster_rcnn-num_classes(90 -> 50(the number of classes of my own dataset), train_config-batch_size(1 -> 10), train_config-num_steps(200000 -> 100), train_input_reader-tf_record_input_reader-input_path(to the path where TFRecord resides) and train_input_reader-label_map_path(to the path where labelmap.pbtxt resides).
Finally, I run the command
python train.py \
--logtostderr \
--pipeline_config_path="PATH WHERE CONFIG FILE RESIDES" \
--train_dir="PATH WHERE MODEL DIRECTORY RESIDES"
And I met the error below:
InvalidArgumentError (see above for traceback): ConcatOp : Dimensions
of inputs should match: shape[0] = [1,890,600,3] vs. shape[1] =
[1,766,600,3] [[Node: concat_1 = ConcatV2[N=10, T=DT_FLOAT,
Tidx=DT_INT32,
_device="/job:localhost/replica:0/task:0/cpu:0"](Preprocessor/sub, Preprocessor_1/sub, Preprocessor_2/sub, Preprocessor_3/sub,
Preprocessor_4/sub, Preprocessor_5/sub, Preprocessor_6/sub,
Preprocessor_7/sub, Preprocessor_8/sub, Preprocessor_9/sub,
concat_1/axis)]]
It seems like the dimension of input images so it may be caused by not resizing the raw image data.
But As I know, model automatically resizes the input image to train (isn't it?)
Then I'm stuck with this issue.
If there is solution, I'll appreciate it for your answer.
Thanks.
UPDATE
When I updated my batch_size field from 10 to one(original one), it seems to train without any problem... but I don't understand why...

TaeWoo is right, you have to set batch_size to 1 in order to train Faster RCNN.
This is because FRCNN uses a keep_aspect_ratio_resizer, which in turn means that if you have images of different sizes, they will also be different sizes after the preprocessing. This practically makes batching impossible, since a batch tensor has a shape [num_batch, height, width, channels]. You can see this is a problem when (height, width) differ from one example to the next.
This is in contrast to the SSD model, which uses a "normal" resizer, i.e. regardless of the input image, all preprocessed examples will end-up having the same size, which allows them to be batched together.
Now, if you have images of different sizes, you practically have two ways of using batching:
use Faster RCNN and pad your images before, either one time before training, or continuously as a preprocessing step. I'd suggest the former, since this type of preprocessing seems to slow down learning a lot
use SSD, but be sure that your objects are not affected too much by distortion. This shouldn't be a very big problem, it works as a way of data augmentation.

I had the same problem. Setting batch_size=1 does indeed seem to solve this problem but i am not sure if this will have any effect on accuracy of the model. Would love to get TF team's answer to this.

I had a similar problem that I want to share, maybe it would others with similar situations. I've changed SSD OD net to find bboxes with a fifth variable which is an angle. The problem was that we inserted an empty list to the angle variable in the bounding box and then I had a problem in tf.concat operation :
Dimensions of inputs should match: shape[0] = [1,43] vs. shape[4] = [1,0]
(shape[0] changed if I rerun the session but shape[4] stayed the same [1,0])
I fixed the problem by fixing my tf record to have a list of angles in the same lenth of other bbox variables (xmin, xmax, ymin, ymax).
Hope it helps someone , it took me a whole day to find out the problem.
Regards,
Alon

Related

How to train different size of image using cnn? [duplicate]

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

How to use pre-trained models without classes in Tensorflow?

I'm trying to use a pretrained network such as tf.keras.applications.ResNet50 but I have two problems:
I just want to obtain the top embedding layers at the end of the network, because I don't want to do any image classification. So due to this there is no need for a classes number I think.
tf.keras.applications.ResNet50 takes a default parameter 'classes=1000'
Is there a way how I can omit this parameter?
My input pictures are 128*128*1 pixels and not 224*224*3
What is the best way to fix my input data shape?
My goal is to make a triplet loss network with the output of a resnet network.
Thanks a lot!
ResNet50 has a parameter include_top exactly for that purpose -- set it to False to skip the last fully connected layer. (It then outputs a feature vector of length 2048).
The best way to reduce your image size is to resample the images, e.g. using the dedicated function tf.image.resample_images.
Also, I did not notice at first that your input images have only three channels, thx #Daniel. I suggest you build your 3-channel grayscale image on the GPU (not on the host using numpy) to avoid tripling your data transfer to GPU memory, using tf.tile:
im3 = tf.tile(im, (1, 1, 1, 3))
As a complement to the other answer. You will also need to make your images have three channels, although technically not the best input for Resnet, it is the easiest solution (changing the Resnet model is an option too, if you visit the source code and change the input shape yourself).
Use numpy to pack images in three channels:
images3ch = np.concatenate([images,images,images], axis=-1)

What is the proper way of doing inference on a batch of images in Tensorflow for performance purposes

What I am trying to accomplish is doing inference in Tensorflow with a batch of images at a time instead of a single image. I am wondering what is the most appropriate way of handling processing of multiple images to speed up inference?
Doing inference on a single image is easily done and quite used in most tutorials, but what I have not seen yet is doing that in a batch-like style.
Here's what I am currently using at a high level:
pl = tf.placeholder(tf.float32)
...
sess.run([boxes, confs], feed_dict={pl: image})
I would appreciate any input on this.
Depending on how your model is designed, you can just feed an array of images to pl. The first dimension of your outputs then corresponds to the index of your image in the batch.
Many tensor ops have an implementation for multiple examples in a batch. There are some exceptions though, for example tf.image.decode_jpeg. In this case, you will have to rewrite your network, using tf.map_fn, for example.

Any better approach to solve MemoryError?

I am working on deep learning. I am using Keras with tensorflow backend, and I have 36980 images to train. I want to use VGG16, so I resized all of them to (224*224) size. So the size of the train array is around 22GB (36980*224*224*3*4 bytes). When I try to load this amount data into a numpy array, python shows MemoryError.
I have thought of splitting the training set into 10 pieces and train my model on only one of such piece at a time. Is there any better approach to solve this problem? I am using Python 3 (64 bit version).
N.B.
To get a good accuracy, I need as large image as possible, so I can't resize them to a smaller size. Moreover, it's necessary to use RGB images here.
No fit_generator() solution please. A model trained using fit_generator() behaves abnormally while predicting, at least as far as I have seen.

Issue with tensorflow shape with b/w images

I'm a tensorflow beginner so please bear with me.
Right now I am trying to modifiy an existing python programm for a CNN that creates a superresolution image. The Code can be found here if you're interested: https://github.com/pinae/Superresolution
The input tensor has the shape <5,240,320,3>, 5 being batch size, 240 and 320 the size of the image(s) and 3 being the number of channels (RGB). I want to modify this program for black and white images, so just 1 channel -> <5,240,320,1>
First, I convert the testing and validation images to b/w:
image = image.convert('L')
The images then get written into an array and this is where my issue starts. The array will have the size of <240,320>. The array of 5 images get written into a list and is handed over to tensorflow.
Tensorflow expects a <5,240,320,1> tensor but the list of images has the shape <5,240,320>, so one dimension is missing. I tried adding a dimension with np.expand_dims and the like but no success.
input_batches = np.expand_dims(input_batches, axis=-1)
Why does the index of channels of a tensorflow placeholder seem to start at 1 while the index of resolution starts at 0?
I'm sure there will be many more issues down the road like adjusting the filters but this is where I'm stuck now.
If you have a tensor of shape [5,240,320] you can reshape it to be [5,240,320,1] with this one command
correctSizedTensor = tf.reshape( wrongSizedTensor, [5,240,320,1] )
You need to understand the code. I believe the problem may be deeper rooted. With the limited information, I assume you are using network.py. In there, we see this in line 16:
self.inputs = tf.placeholder(
tf.float32, [batch_size, dimensions[1], dimensions[0], 3], name='input_images'
)
The depth dimension is already hard coded as 3. You would have to edit that as well, amongst possibly many other things.
As a caveat, most super-resolution CNNs use small patch sizes. Definitely not (240, 320), I suspect it will be hard to converge as the batch size is small.

Categories