I am working on a problem where I have to classify images into different groups. I am a beginner and working with Keras with simple sequence model. How should I tackle the problem of images with different dimension in below code e.g. some images have dimension 2101583 while some have 210603 etc. Please suggest.
model.add(Dense(100,input_dim = ?,activation= "sigmoid"))
model.add(Dense(100,input_dim = ?,activation= "sigmoid"))
For simple classification algorithms in Keras the input should always be the same size. Since Feedforward Neural Networks are consisting of an input layer, one to many hidden layers and one output layer, with all nodes connected, you should always have an Input at each Input-Node. Moreover, the shape and other hyperparameters of the Neural Network are static, so you can't change the number of inputs, and therefor the size for each Image in one Neural Network.
The best practice for your case would be to either downsize all Images to the size of your smallest Image, or upsize all Images to the size of your largest Image.
Downsizing
With downsizing you would actively delete pixels from your Image, including information contained in the pixels. This can lead to overfitting, but would decrease the computational time, too.
Upsizing
With upsizing you would add pixels to your image, without adding information. This would increase computational time, but you would keep the inforamtion of each Image.
For a good start I would suggest you to try and downsize your Images to the smallest one. This is a common practice in science as well [1]. One library to do so is OpenCV, for implementation issues, please refer to the multiple questions on Stackoverflow:
Python - resize image
Opencv-Python-Resizing image
Related
If I'm working with a dataset where I have ~100,000 training images and ~20,000 validation images, each of size 32 x 32 x 3, how does the size and the dimensions of my dataset affect the number of Conv2d layers I have in my CNN? My intuition is to use fewer Conv2d layers, 2-3, because any more than 3 layers will be working with parts of the image that are too small to gain relevant data from.
In addition, does it make sense to have layers with a large number of filters, >128? My thought is that when dealing with small images, it doesn't make sense to have a large number of parameters.
Since you have the exact input size like the images in Cifar10 and Cifar100 just have a look what people tried out.
In general you can start with something like a ResNet18. Also I don't quite understand why you say
because any more than 3 layers will be working with parts of the image that are too small to gain relevant data from.
As long as you don't downsample using something like max pooling or a conv with padding 1 and stride 2. The size of 32x32 will be the same and only the number of channels will change depending on the network.
Designing networks is almost always at looking what did other people do and what worked for them and starting from there. You almost never want to do it from scratch on your own, since the iteration cycles are just to long and models released by researches from Google, Facebook ... had way more resources then you will ever have to find something good.
I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .
How to give superpixels as input to CNN? I used SLIC algorithm to segment the images into superpixels.
How can I use this for classification using CNN?
I will try to help you. CNN (Convolutional Neural Networks) work with unique datas of input, not matrices (superpixel is a matrix). So, for this, you need to remove each superpixel and make it its own image. So, in other words, if you segment your image in 300 superpixels, after, you need to create 300 new images, one of each superpixel.
After this it's notorious that each new image, perhaps, will have differents sizes. You can't work like that, because the number of neurons of input in the CNN can't change. For this, you can centralize each "new image" in a background NxN ('N' must be enough to cover all new images). WIth a centralized superpixel (new images centralized), each pixel will be input of your CNN.In other words:
1) Each centralized superpixel will be input one at a time;
2) The quantity of inputs in the CNN will be X*Y, being X the shape[0] of the superpixel centralized, and Y the shape[1] of the superpixel centralized;
3) Whereas 300 superpixels centralized, your CNN must calculate the output for each one.
Ilustration: https://imgur.com/k8pRDw7
Look the ilustration and good luck! :)
I am extracting a road network from satellite imagery. Herein the pixel classification is binary ( 0 = non-road, 1 = road). Hence, the mask of the complete satellite image which is 6400 x 6400 pixels shows one large road network where each road is connected to another road. For the implementation of the U-net I divided that large image in 625 images of 256 x 256 pixels.
My question is: Can a neural network easier find structure with an increase in batch size (thus can it find structure between different batches), or can it only find structure if the input image size is enlarged?
If your model is a regular convolutional network (without any weird hacks), the samples in a batch will not be connected to each other.
Depending on which loss function you use, the batch size might be important too. For regular functions (available 'mse', 'binary_crossentropy', 'categorical_crossentropy', etc.), they all keep the samples independent from each other. But some losses might consider the entire batch. (F1 metrics, for instance). If you're using a loss function that doesn't treat samples independently, then the batch size is very important.
That said, having a bigger batch size may help the net to find its way more easily, since one image might push weights towards one direction, while another may want a different direction. The mean results of all images in the batch should then be more representative of a general weight update.
Now, entering an experimenting field (we never know everything about neural networks until we test them), consider this comparison:
a batch with 1 huge image versus
a batch of patches of the same image
Both will have the same amount of data, and for a convolutional network, it wouldn't make a drastic difference. But for the first case, the net will probably be better at finding connections between roads, maybe find more segments where the road might be covered by something, while the small patches, being full of borders might be looking more into textures and be not good at identifying these gaps.
All of this is, of course, a guess. Testing is the best.
My net in a GPU cannot really use big patches, which is bad for me...
I'm running the default classify_image code of the imagenet model. Is there any way to visualize the features that it has extracted? If I use 'pool_3:0', that gives me the feature vector. Is there any way to overlay this on top of my image to see which features it has picked as important?
Ross Girshick described one way to visualize what a pooling layer has learned: https://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Essentially instead of visualizing features, you find a few images that a neuron fires most on. You repeat that for a few or all neurons from your feature vector. The algorithm needs lots of images to choose from of course, e.g. the test set.
I wrote my implementation of this idea for cifar10 model in Tensorflow today, which I want to share (uses OpenCV): https://gist.github.com/kukuruza/bb640cebefcc550f357c
You could use it if you manage to provide the images tensor for reading images by batches, and the pool_3:0 tensor.