How to train different size of image using cnn? [duplicate] - python

I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?

You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.

Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .

Related

Number of Conv2d Layers and Filters for Small Image Classification Task

If I'm working with a dataset where I have ~100,000 training images and ~20,000 validation images, each of size 32 x 32 x 3, how does the size and the dimensions of my dataset affect the number of Conv2d layers I have in my CNN? My intuition is to use fewer Conv2d layers, 2-3, because any more than 3 layers will be working with parts of the image that are too small to gain relevant data from.
In addition, does it make sense to have layers with a large number of filters, >128? My thought is that when dealing with small images, it doesn't make sense to have a large number of parameters.
Since you have the exact input size like the images in Cifar10 and Cifar100 just have a look what people tried out.
In general you can start with something like a ResNet18. Also I don't quite understand why you say
because any more than 3 layers will be working with parts of the image that are too small to gain relevant data from.
As long as you don't downsample using something like max pooling or a conv with padding 1 and stride 2. The size of 32x32 will be the same and only the number of channels will change depending on the network.
Designing networks is almost always at looking what did other people do and what worked for them and starting from there. You almost never want to do it from scratch on your own, since the iteration cycles are just to long and models released by researches from Google, Facebook ... had way more resources then you will ever have to find something good.

How to tackle different Image dimensions for classification

I am working on a problem where I have to classify images into different groups. I am a beginner and working with Keras with simple sequence model. How should I tackle the problem of images with different dimension in below code e.g. some images have dimension 2101583 while some have 210603 etc. Please suggest.
model.add(Dense(100,input_dim = ?,activation= "sigmoid"))
model.add(Dense(100,input_dim = ?,activation= "sigmoid"))
For simple classification algorithms in Keras the input should always be the same size. Since Feedforward Neural Networks are consisting of an input layer, one to many hidden layers and one output layer, with all nodes connected, you should always have an Input at each Input-Node. Moreover, the shape and other hyperparameters of the Neural Network are static, so you can't change the number of inputs, and therefor the size for each Image in one Neural Network.
The best practice for your case would be to either downsize all Images to the size of your smallest Image, or upsize all Images to the size of your largest Image.
Downsizing
With downsizing you would actively delete pixels from your Image, including information contained in the pixels. This can lead to overfitting, but would decrease the computational time, too.
Upsizing
With upsizing you would add pixels to your image, without adding information. This would increase computational time, but you would keep the inforamtion of each Image.
For a good start I would suggest you to try and downsize your Images to the smallest one. This is a common practice in science as well [1]. One library to do so is OpenCV, for implementation issues, please refer to the multiple questions on Stackoverflow:
Python - resize image
Opencv-Python-Resizing image

Increasing accuracy by changing batch-size and input image size

I am extracting a road network from satellite imagery. Herein the pixel classification is binary ( 0 = non-road, 1 = road). Hence, the mask of the complete satellite image which is 6400 x 6400 pixels shows one large road network where each road is connected to another road. For the implementation of the U-net I divided that large image in 625 images of 256 x 256 pixels.
My question is: Can a neural network easier find structure with an increase in batch size (thus can it find structure between different batches), or can it only find structure if the input image size is enlarged?
If your model is a regular convolutional network (without any weird hacks), the samples in a batch will not be connected to each other.
Depending on which loss function you use, the batch size might be important too. For regular functions (available 'mse', 'binary_crossentropy', 'categorical_crossentropy', etc.), they all keep the samples independent from each other. But some losses might consider the entire batch. (F1 metrics, for instance). If you're using a loss function that doesn't treat samples independently, then the batch size is very important.
That said, having a bigger batch size may help the net to find its way more easily, since one image might push weights towards one direction, while another may want a different direction. The mean results of all images in the batch should then be more representative of a general weight update.
Now, entering an experimenting field (we never know everything about neural networks until we test them), consider this comparison:
a batch with 1 huge image versus
a batch of patches of the same image
Both will have the same amount of data, and for a convolutional network, it wouldn't make a drastic difference. But for the first case, the net will probably be better at finding connections between roads, maybe find more segments where the road might be covered by something, while the small patches, being full of borders might be looking more into textures and be not good at identifying these gaps.
All of this is, of course, a guess. Testing is the best.
My net in a GPU cannot really use big patches, which is bad for me...

Extracting features from caffe last layer

I am trying to extract features from the last layer of a GoogleNet caffe model fine-tuned on car classification. Here's the deploy.prototxt. I tried a couple of things:
I took features from 'loss3_classifier_model' layer which is incorrect.
Now I am extracting features from 'pool5' layer in the model as given in prototxt.
I am not sure whether it's correct or not because the features I am extracting for different cars doesn't seem to have much difference. In other words, I am unable to differentiate cars using this last layer features, I used Euclidean distance on features (Is it correct?). I am not using using softmax as I don't want to classify them, I just want features and I rechecking them using euclidean distance.
These are the steps I followed:
## load the model
net = caffe.Net('deploy.prototxt',
caffe.TEST,
weights ='googlenet_finetune_web_car_iter_10000.caffemodel')
# resize the input size as I have only one image in my batch.
net.blobs["data"].reshape(1, 3, 224, 224)
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
bbox = frame[int(x1):int(x2), int(y1):int(y2)] # getting the car, # I have stored x1,x2,x3,x4 seperatly.
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(bbox, (224, 224))
# to align my input to the input of the model
bbox_input = bbox.swapaxes(1,2).reshape(3,224,224)
# fed input image to the model.
net.blobs['data'].data[0] = bbox_input
net.forward()
# features from pool5 layer or the last layer.
temp = net.blobs["pool5"].data[0]
Now, I want to confirm if these steps are correct or not? I am new to caffe and I am not sure about the steps I wrote above.
Both options are valid. The farther you are from the end of your network, the less specialized the features will be to your problem/training set, while still capturing relevant information that may be applied to similar tasks. As you move to the end of the network, the features will be more tuned to your task.
Note that you are dealing with two similar problems/tasks. The network was fine-tuned for car classification ("which model is this car?") and now you want to verify if two cars belong to the same model.
Considering the network was fine-tuned with a large and representative training set, the features obtained from it are powerful and with a lot of representation capability (i.e., they capture a lot of complex underlying patterns of the task they were trained to) that are useful to your verification task.
With this in mind, you could try many ways of comparing two feature vectors:
Euclidean Distance is too simple. I would try it only because it is easy/fast to implement;
Cosine Similarity [1] might also be a simple, but good starting point;
Classifier. Another possibility that I've have done in a similar problem was to train a classifier (SVM, Logistic Regression) on top of a combination of the two features. The input of your classifier could be the concatenation of them side by side.
Incorporate the verification task to your network. You could alter the GoogleNet architecture to receive two photos of cars and output if they belong or not to the same model. You would be transforming/fine-tuning your network from the classification problem to a verification task. Check for siamese networks [2].
Edit: there is a mistake when you resize your frame that may be the cause of your problems!
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(frame, (224, 224))
You should have passed frame as input in the cv2.resize() method. You are probably feeding a garbage input to the network and that is why the output ends up always looking similar.

How to visualize features classified by tensorflow?

I'm running the default classify_image code of the imagenet model. Is there any way to visualize the features that it has extracted? If I use 'pool_3:0', that gives me the feature vector. Is there any way to overlay this on top of my image to see which features it has picked as important?
Ross Girshick described one way to visualize what a pooling layer has learned: https://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Essentially instead of visualizing features, you find a few images that a neuron fires most on. You repeat that for a few or all neurons from your feature vector. The algorithm needs lots of images to choose from of course, e.g. the test set.
I wrote my implementation of this idea for cifar10 model in Tensorflow today, which I want to share (uses OpenCV): https://gist.github.com/kukuruza/bb640cebefcc550f357c
You could use it if you manage to provide the images tensor for reading images by batches, and the pool_3:0 tensor.

Categories