Extract image features using Caffe for custom image classifier - python

I would like to obtain the output of the 6th layer of a pre-built caffe model and train an SVM on top of it. My intention is to build a custom image classifier, where the user can create custom image classes, and classify input images among those classes, instead of the imagenet classes.Here is the pseudo code:
#input
file='cat.jpg'
image=caffe.io.load_image(file)
#model
net = caffe.Classifier('deploy.prototxt','model.caffemodel')
#compute activation at layer 6 --- Need help here. Will net.forward help? will the activation be retained?
#extract features from layer 6:
features = net.blobs['fc6'].data[4][:,0, 0]
#SVM
category=svm.predict(features)
print get_category_name(category)

You should use Net class, instead of Classifier. Thus, you just need to call net.forward().
Two things to pay attention to:
Preprocess your input image. See Transformer class here for reference.
If you extract the features by using just
features = net.blobs['fc6'].data
your array will be overwritten by the next forward() call. Be sure you are performing a deep copy, such as
features = net.blobs['fc6'].data.copy()

Related

Looking for layer names for keras inceptionresnetv2

Really don't have much idea of what I'm doing, followed this tutorial to process deepdream images https://www.youtube.com/watch?v=Wkh72OKmcKI
Trying to change the base model data set to any from here, https://keras.io/api/applications/#models-for-image-classification-with-weights-trained-on-imagenet particularly InceptionResNetV2 currently. InceptionV3 uses "mixed0" up to "mixed10" whereas, the former data set uses a different naming system apparently.
Would have to change this section
# Maximize the activations of these layers
names = ['mixed3', 'mixed5']
layers = [base_model.get_layer(name).output for name in names]
# Create the feature extraction model
dream_model = tf.keras.Model(inputs=base_model.input, outputs=layers)
I'm getting an error "no such layer: mixed3"
So yea, just trying to figure out how to get the names for the layers in this data set as well as others
You can simply enter the following code to find out the model architecture(including layer names).
#Model denotes the Inception model
model.summary()
Or to visualize complex relationships,
tf.keras.utils.plot_model(model, to_file='model.png')

Implementing attention in Keras classification

I would like to implement attention to a trained image classification CNN model. For example, there are 30 classes and with the Keras CNN, I obtain for each image the predicted class. However, to visualize the important features/locations of the predicted result. I want to add a Soft Attention after the FC layer. I tried to read the "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" to obtain similar results. However, I could not understand how the author has implemented. Because my problem is not an image caption or text seq2seq problem.
I have an image classification CNN and would like to extract the features and put it into a LSTM to visualize soft attention. Although I am getting stuck every time.
The steps I took:
Load CNN model
Extract features from a single image (however, the LSTM will check the same image with some removed patches in the image)
The steps I took:
Load CNN model (I already trained the CNN earlier for predictions)
Extract features from a single image (however, the LSTM will check the same image with some removed patches in the image)
Getting stuck after steps below:
Create LSTM with soft attention
Obtain a single output
I am using Keras with TensorFlow background. CNN features are extracted using ResNet50. The images are 224x224 and the FC layer has 2048 units as output shape.
#Extract CNN features:
base_model = load_model(weight_file, custom_objects={'custom_mae': custom_mae})
last_conv_layer = base_model.get_layer("global_average_pooling2d_3")
cnn_model = Model(input=base_model.input, output=last_conv_layer.output)
cnn_model.trainable = False
bottleneck_features_train_v2 = cnn_model.predict(train_gen.images)
#Create LSTM:
seq_input = Input(shape=(1, 224, 224, 3 ))
encoded_frame = TimeDistributed(cnn_model)(seq_input)
encoded_vid = LSTM(2048)(encoded_frame)
lstm = Dropout(0.5)(encoded_vid)
#Add soft attention
attention = Dense(1, activation='tanh')(lstm)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
#output 101 classes
predictions = Dense(101, activation='softmax', name='pred_age')(attention)
What I expect is to give an image feature from the last FC layer. Add soft attention to LSTM to train attention weights and would like to obtain a class from the output and visualize the soft attention to know where the system is looking at with doing the prediction (similar soft attention visualization as in the paper).
As I am new to the attention mechanism, I did much research and could not find a solution/understanding of my problem. I would like to know if I am doing it right.

Which layer of VGG19 should I use to extract feature

Now, I want feature of image to compute their similarity. We can get feature using pre-trained VGG19 model in tensorflow easily. But VGG19 model has many layers, and I don't know which layer should I use to get feature. Which layer's output is appropriate for this problem?
# I think this how is correct to extract feature
model = tf.keras.application.VGG19(include_top=True,
weight='imagenet')
input = model.input
output = model.layers[-2].output
extract_model = tf.keras.Model(input, output)
It's my infer that the more closer to last output, the more the model output powerful feature. But some tutorials says 'use include_top=False to extract feature' (e.g Image Captioning with Attention TensorFlow)
So, I don't know which layer should I use. Please try to help me here in this thread.
The include_top=False may be used because the last 3 layers (for that specific model) are fully connected layers which are not typically good feature vectors. If the model directly outputs a feature vector, then you don't need it.
Most people use the last layer for transfer learning, but it may depend on your application. For example, Gatys et. al. show that the first few layers of VGG are sensitive to the style of the image and later layers are sensitive to the content.
I would probably try all of them in a hyperparameter search and see which gives the best performance. If by image similarity you mean the similarity of objects contained inside, I would probably start with the last layer.

How to train a neural network model with bert embeddings instead of static embeddings like glove/fasttext?

I am looking for some heads up to train a conventional neural network model with bert embeddings that are generated dynamically (BERT contextualized embeddings which generates different embeddings for the same word which when comes under different context).
In normal neural network model, we would initialize the model with glove or fasttext embeddings like,
import torch.nn as nn
embed = nn.Embedding(vocab_size, vector_size)
embed.weight.data.copy_(some_variable_containing_vectors)
Instead of copying static vectors like this and use it for training, I want to pass every input to a BERT model and generate embedding for the words on the fly, and feed them to the model for training.
So should I work on changing the forward function in the model for incorporating those embeddings?
Any help would be appreciated!
If you are using Pytorch. You can use https://github.com/huggingface/pytorch-pretrained-BERT which is the most popular BERT implementation for Pytorch (it is also a pip package!). Here I'm just going to outline how to use it properly.
For this particular problem there are 2 approaches - where you obviously cannot use the Embedding layer:
You can incorporate generating BERT embeddings into your data preprocessing pipeline. You will need to use BERT's own tokenizer and word-to-ids dictionary. The repo's README has examples on preprocessing.
You can write a loop for generating BERT tokens for strings like this (assuming - because BERT consumes a lot of GPU memory):
(Note: to be more proper you should also add attention masks - which are LongTensor of 1 & 0 masking the sentence lengths)
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
batch_size = 32
X_train, y_train = samples_from_file('train.csv') # Put your own data loading function here
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
X_train = [tokenizer.tokenize('[CLS] ' + sent + ' [SEP]') for sent in X_train] # Appending [CLS] and [SEP] tokens - this probably can be done in a cleaner way
bert_model = BertModel.from_pretrained('bert-base-uncased')
bert_model = bert_model.cuda()
X_train_tokens = [tokenizer.convert_tokens_to_ids(sent) for sent in X_train]
results = torch.zeros((len(X_test_tokens), bert_model.config.hidden_size)).long()
with torch.no_grad():
for stidx in range(0, len(X_test_tokens), batch_size):
X = X_test_tokens[stidx:stidx + batch_size]
X = torch.LongTensor(X).cuda()
_, pooled_output = bert_model(X)
results[stidx:stidx + batch_size,:] = pooled_output.cpu()
After which you obtain the results tensor which contains the calculated embeddings, where you can use it as an input to your model.
The full (and more proper) code for this is provided here
This method has the advantage of not having to re-calculate these embeddings every epoch.
With this method, e.g for classification your model should only consist of a Linear(bert_model.config.hidden_size, num_labels) layer, inputs to the model should be the results tensor in the above code
Second, and arguably cleaner method: If you check out the repo, you can find there is wrappers for various tasks (e.g BertForSequenceClassification). It should also be easy to implement your custom classes that inherits from BertPretrainedModel and utilizes the various Bert classes from the repo.
For example, you can use:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', labels=num_labels) # Where num_labels is the number of labels you need to classify.
After which you can continue with the preprocessing, up until generating token ids. Then you can train the entire model (but with a low learning rate e.g Adam 3e-5 for batch_size = 32)
With this you can fine-tune BERT's embeddings itself, or use techniques like freezing BERT for a few epochs to train the classifier only, then unfreeze to fine-tune etc. But it is also more computationally expensive.
An example for this is also provided in the repo

Extracting features from caffe last layer

I am trying to extract features from the last layer of a GoogleNet caffe model fine-tuned on car classification. Here's the deploy.prototxt. I tried a couple of things:
I took features from 'loss3_classifier_model' layer which is incorrect.
Now I am extracting features from 'pool5' layer in the model as given in prototxt.
I am not sure whether it's correct or not because the features I am extracting for different cars doesn't seem to have much difference. In other words, I am unable to differentiate cars using this last layer features, I used Euclidean distance on features (Is it correct?). I am not using using softmax as I don't want to classify them, I just want features and I rechecking them using euclidean distance.
These are the steps I followed:
## load the model
net = caffe.Net('deploy.prototxt',
caffe.TEST,
weights ='googlenet_finetune_web_car_iter_10000.caffemodel')
# resize the input size as I have only one image in my batch.
net.blobs["data"].reshape(1, 3, 224, 224)
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
bbox = frame[int(x1):int(x2), int(y1):int(y2)] # getting the car, # I have stored x1,x2,x3,x4 seperatly.
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(bbox, (224, 224))
# to align my input to the input of the model
bbox_input = bbox.swapaxes(1,2).reshape(3,224,224)
# fed input image to the model.
net.blobs['data'].data[0] = bbox_input
net.forward()
# features from pool5 layer or the last layer.
temp = net.blobs["pool5"].data[0]
Now, I want to confirm if these steps are correct or not? I am new to caffe and I am not sure about the steps I wrote above.
Both options are valid. The farther you are from the end of your network, the less specialized the features will be to your problem/training set, while still capturing relevant information that may be applied to similar tasks. As you move to the end of the network, the features will be more tuned to your task.
Note that you are dealing with two similar problems/tasks. The network was fine-tuned for car classification ("which model is this car?") and now you want to verify if two cars belong to the same model.
Considering the network was fine-tuned with a large and representative training set, the features obtained from it are powerful and with a lot of representation capability (i.e., they capture a lot of complex underlying patterns of the task they were trained to) that are useful to your verification task.
With this in mind, you could try many ways of comparing two feature vectors:
Euclidean Distance is too simple. I would try it only because it is easy/fast to implement;
Cosine Similarity [1] might also be a simple, but good starting point;
Classifier. Another possibility that I've have done in a similar problem was to train a classifier (SVM, Logistic Regression) on top of a combination of the two features. The input of your classifier could be the concatenation of them side by side.
Incorporate the verification task to your network. You could alter the GoogleNet architecture to receive two photos of cars and output if they belong or not to the same model. You would be transforming/fine-tuning your network from the classification problem to a verification task. Check for siamese networks [2].
Edit: there is a mistake when you resize your frame that may be the cause of your problems!
# I read my image of size (x,y,3)
frame = cv2.imread(frame_path)
# resized my image to 224,224,3, network input size.
bbox = cv2.resize(frame, (224, 224))
You should have passed frame as input in the cv2.resize() method. You are probably feeding a garbage input to the network and that is why the output ends up always looking similar.

Categories