Now, I want feature of image to compute their similarity. We can get feature using pre-trained VGG19 model in tensorflow easily. But VGG19 model has many layers, and I don't know which layer should I use to get feature. Which layer's output is appropriate for this problem?
# I think this how is correct to extract feature
model = tf.keras.application.VGG19(include_top=True,
weight='imagenet')
input = model.input
output = model.layers[-2].output
extract_model = tf.keras.Model(input, output)
It's my infer that the more closer to last output, the more the model output powerful feature. But some tutorials says 'use include_top=False to extract feature' (e.g Image Captioning with Attention TensorFlow)
So, I don't know which layer should I use. Please try to help me here in this thread.
The include_top=False may be used because the last 3 layers (for that specific model) are fully connected layers which are not typically good feature vectors. If the model directly outputs a feature vector, then you don't need it.
Most people use the last layer for transfer learning, but it may depend on your application. For example, Gatys et. al. show that the first few layers of VGG are sensitive to the style of the image and later layers are sensitive to the content.
I would probably try all of them in a hyperparameter search and see which gives the best performance. If by image similarity you mean the similarity of objects contained inside, I would probably start with the last layer.
Related
I have a pre-trained image classification model saved in caffe, the model is expected to get grayscale(one channel) images. I want to use this model in a tool that only provides input of RGB(three channels) to the model. It is not possible to change the way this tool provides images so I thought of adding a layer before the input layer that transforms the input to one channel only, is that possible in caffe? and how?
I'm looking for a solution that doesn't require to define new layers to caffe if possible.
Note that I have the ".prototxt" and the ".weights" files of the model.
I previously did a similar thing in tensorflow but I don't know if this is possible in caffe and didn't find much material online.
You can add a Python layer to do it for you.
What is a Python layer.
An example of such a layer can be found here.
I used Keras to build a Siamese network using the coding format of one of the questions posted (please see code sample here). To explain this briefly, I built a Siamese network using the pretrained efficient net so that each copy of the network produces a dense layer which then get combined into into a L1-similarity output.
However, during prediction time, I only want to obtain the dense output of one of the layers (as an embedding). I plan on using a variety of unsupervised learning methods (including KNN) on these outputs.
During prediction, how can I ask keras to run only one copy of my network graph using a single input? Can I extract only a part of the NN graph? I don't want to have to always generate pairs of images or run the cost of running 2 images when I only need one output.
Let me just make sure that I understand your question and context. You are using a Siamese network (efficient net) and you want to generate embeddings for your input images.
From the image below, you only want to save the image encodings for one the ConvNets?
If that is the case, I dont really see the point of building a Siamese network at all. Just go for a single ConvNet (using efficient net). Because if you use the Siamese network model, it will always ask you to make image pairs.
If you go for only a single ConvNet model, and you identify the layer which you want to use to get the embeddings, then you can use the tf.keras.backend.function like this:
get_layer_output = tf.keras.backend.function([fine_tuned_model.layers[0].input],[fine_tuned_model.layers[-2].output])
Which then, for the predict, you can call it like this:
features = get_layer_output([x])[0]
I want to use tensorflow hub to generate features for my images, but it seems that the 2048 features of Inception Module are not enough for my problem because my class images are very similar. so I decided to use the features of a hidden layer of this module, for example:
"module/InceptionV3/InceptionV3/Mixed_7c/concat:0"
so how can I write a function that gives me this ?*8*8*2048 features from my input images?
Please try
module = hub.Module(...) # As before.
outputs = module(dict(images=images),
signature="image_feature_vector",
as_dict=True)
print(outputs.items())
Besides the default output with the final feature vector output, you should see a bunch of intermediate feature maps, under keys starting with InceptionV3/ (or whichever other architecture you select). These are 4D tensors with shape [batch_size, feature_map_height, feature_map_width, num_features], so you might want to remove those middle dimensions by avg- or max-pooling over them before feeding this into classification.
I am trying to extract features from the last layer of a GoogleNet caffe model fine-tuned on car classification. Here's the deploy.prototxt. I tried a couple of things:
I took features from 'loss3_classifier_model' layer which is incorrect.
Now I am extracting features from 'pool5' layer in the model as given in prototxt.
I am not sure whether it's correct or not because the features I am extracting for different cars doesn't seem to have much difference. In other words, I am unable to differentiate cars using this last layer features, I used Euclidean distance on features (Is it correct?). I am not using using softmax as I don't want to classify them, I just want features and I rechecking them using euclidean distance.
These are the steps I followed:
## load the model
net = caffe.Net('deploy.prototxt',
caffe.TEST,
weights ='googlenet_finetune_web_car_iter_10000.caffemodel')
# resize the input size as I have only one image in my batch.
net.blobs["data"].reshape(1, 3, 224, 224)
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
bbox = frame[int(x1):int(x2), int(y1):int(y2)] # getting the car, # I have stored x1,x2,x3,x4 seperatly.
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(bbox, (224, 224))
# to align my input to the input of the model
bbox_input = bbox.swapaxes(1,2).reshape(3,224,224)
# fed input image to the model.
net.blobs['data'].data[0] = bbox_input
net.forward()
# features from pool5 layer or the last layer.
temp = net.blobs["pool5"].data[0]
Now, I want to confirm if these steps are correct or not? I am new to caffe and I am not sure about the steps I wrote above.
Both options are valid. The farther you are from the end of your network, the less specialized the features will be to your problem/training set, while still capturing relevant information that may be applied to similar tasks. As you move to the end of the network, the features will be more tuned to your task.
Note that you are dealing with two similar problems/tasks. The network was fine-tuned for car classification ("which model is this car?") and now you want to verify if two cars belong to the same model.
Considering the network was fine-tuned with a large and representative training set, the features obtained from it are powerful and with a lot of representation capability (i.e., they capture a lot of complex underlying patterns of the task they were trained to) that are useful to your verification task.
With this in mind, you could try many ways of comparing two feature vectors:
Euclidean Distance is too simple. I would try it only because it is easy/fast to implement;
Cosine Similarity [1] might also be a simple, but good starting point;
Classifier. Another possibility that I've have done in a similar problem was to train a classifier (SVM, Logistic Regression) on top of a combination of the two features. The input of your classifier could be the concatenation of them side by side.
Incorporate the verification task to your network. You could alter the GoogleNet architecture to receive two photos of cars and output if they belong or not to the same model. You would be transforming/fine-tuning your network from the classification problem to a verification task. Check for siamese networks [2].
Edit: there is a mistake when you resize your frame that may be the cause of your problems!
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(frame, (224, 224))
You should have passed frame as input in the cv2.resize() method. You are probably feeding a garbage input to the network and that is why the output ends up always looking similar.
I'm running the default classify_image code of the imagenet model. Is there any way to visualize the features that it has extracted? If I use 'pool_3:0', that gives me the feature vector. Is there any way to overlay this on top of my image to see which features it has picked as important?
Ross Girshick described one way to visualize what a pooling layer has learned: https://www.cs.berkeley.edu/~rbg/papers/r-cnn-cvpr.pdf
Essentially instead of visualizing features, you find a few images that a neuron fires most on. You repeat that for a few or all neurons from your feature vector. The algorithm needs lots of images to choose from of course, e.g. the test set.
I wrote my implementation of this idea for cifar10 model in Tensorflow today, which I want to share (uses OpenCV): https://gist.github.com/kukuruza/bb640cebefcc550f357c
You could use it if you manage to provide the images tensor for reading images by batches, and the pool_3:0 tensor.