Preparing SIFT descriptors for further SVM classification (OpenCV 3, sklearn) - python

I'm trying to classify images using SIFT-computed local descriptors with Bag of Visual Words, KMeans clustering and histograms.
I've read a lot of SO answers and tried to follow these instructions, however, it feels like I don't understand how the whole pipeline should work. Below will be the code I've implemented and it works reeeally slow.
That's why I'm asking this question: to clarify my understanding of using SIFT descriptors for classification and verify my code implementation.
I hope to get feedback on my understanding and get some help in improving my knowledge of the concept.
Firstly, I've written a class wrapper for SIFT. My wrapper computes SIFT descriptors on image patches using sliding window. It also uses Root SIFT for descriptors computation. The function detectAndCompute is its main function and basically it takes an image as an argument, crops it into several sub-images using sliding window, computes Root SIFT descriptors for each sub-image and unites all the descriptors from all sub-images into a single matrix of descriptors.
class DenseRootSIFT(object):
def __init__(self):
self.sift = cv2.xfeatures2d.SIFT_create()
def detectAndCompute(self, image, step_size=12, window_size=(10, 10)):
if window_size is None:
winH, winW = image.shape[:2]
window_size = (winW // 4, winH // 4)
descriptors = np.array([], dtype=np.float32).reshape(0, 128)
for crop in self._crop_image(image, step_size, window_size):
crop = cv2.cvtColor(crop, cv2.COLOR_BGR2GRAY)
descs = self._detectAndCompute(crop)[1]
if descs is not None:
descriptors = np.vstack([descriptors, self._detectAndCompute(crop)[1]])
return descriptors
def _detect(self, image):
return self.sift.detect(image)
def _compute(self, image, kps, eps=1e-7):
kps, descs = self.sift.compute(image, kps)
if len(kps) == 0:
return [], None
descs /= (descs.sum(axis=1, keepdims=True) + eps)
descs = np.sqrt(descs)
return kps, descs
def _detectAndCompute(self, image):
kps = self._detect(image)
return self._compute(image, kps)
def _sliding_window(self, image, step_size, window_size):
for y in xrange(0, image.shape[0], step_size):
for x in xrange(0, image.shape[1], step_size):
yield (x, y, image[y:y + window_size[1], x:x + window_size[0]])
def _crop_image(self, image, step_size=12, window_size=(10, 10)):
crops = []
winH, winW = window_size
for (x, y, window) in self._sliding_window(image, step_size=step_size, window_size=(winW, winH)):
if window.shape[0] != winH or window.shape[1] != winW:
continue
crops.append(image[y:y+winH, x:x+winW])
return np.array(crops)
Below I post my class called DenseRootSiftPreparator that should provide tools for extracting SIFT features from image and preparing them for further classification (particularly, with LinearSVC from sklearn).
So, I follow this process:
Generate a codebook (_generate_codebook function in the class below). The codebook is generated by applying mini-batch KMeans clustering with 2048 clusters. As an output, the function returns a 2048 x 128 matrix.
Then I'm trying to create a histogram for each image in the dataset by following the instructions I've posted above. A histogram for a single image is created using _create_histogram function. At first, the histogram is initialized with zeros. Then the descriptors get computed for the input image and for each descriptor, I'm trying to find the index of the closest descriptor in the previously generated codebook (using KDTree from scipy) and increment the value of the histogram on that index. Then I L2-normalize the histogram array and return it. The same process is repeated for each image. And it is very very slow.
Here's the code for DenseRootSiftPreparator:
class DenseRootSiftPreparator(object):
def __init__(self, histogram_size=2048):
self.X = []
self.dense_root_sift = DenseRootSIFT()
self.histogram_size = histogram_size
def fit(self, image_dataset, y=None):
# #param image_dataset - array of images in OpenCV format
self.X = image_dataset
def extract_descriptors_and_prepare_for_classification(self, image):
return self._get_histograms_for_image(image)
def _get_histograms_for_image(self, image):
codebook = self._generate_codebook(image)
histograms = []
for img in self.X:
histogram = self._create_histogram(img, self.histogram_size, codebook)
histograms.append(histogram)
return histograms
def _create_histogram(self, image, hist_size, codebook):
histogram = np.zeros(hist_size)
descriptors = self.dense_root_sift.detectAndCompute(image, window_size=None)
tree = spatial.KDTree(codebook)
for i in xrange(len(descriptors)):
histogram[tree.query(descriptors[i])[1]] += 1
return normalize(histogram[:, np.newaxis], axis=0).ravel()
def _generate_codebook(self, image):
descriptors = self.dense_root_sift.detectAndCompute(image, window_size=None)
kmeans = MiniBatchKMeans(n_clusters=2048, batch_size=128,
n_init=10, max_no_improvement=10)
kmeans.fit(descriptors)
codebook = kmeans.cluster_centers_[:]
return codebook
I would test my code in the following way:
images = get_images_dataset()
test_input_img = cv2.imread('test_input_image.jpg')
histogram_extractor = DenseRootSiftPreparator()
histogram_extractor.fit(images)
hists = histogram_extractor.extract_descriptors_and_prepare_for_classification(test_input_img)
Here're my imports (just in case):
import numpy as np
from scipy import spatial
import cv2
from cv2.xfeatures2d import SIFT_create
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import normalize
My main questions:
Is my understanding of creating the Bag of Visual Words model using
SIFT descriptors correct?
If not, what am I doing wrong? What could be done better?
Are the functions I've described above work as they should work or I'm missing something out?
Is there a way of making the SIFT descriptor preparation for classification process better and more effecient?

Related

PyTorch for Object detection - Image augmentation

I am using PyTorch for object detection and refining an existing model (transfer learning) as described in the following link -
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html
While different transformations are used for image augmentation (horizontal flip in this tutorial), the tutorial doesnt mention anything on transforming the bounding box/annotation to ensure they are in line with the transformed image. Am I missing something basic here?
In the training phase, the transforms are indeed applied on both images and targets, while loading the data. In the PennFudanDataset class, we have these two lines:
if self.transforms is not None:
img, target = self.transforms(img, target)
where target is a dictionary containing:
target = {}
target["boxes"] = boxes
target["labels"] = labels
target["masks"] = masks
target["image_id"] = image_id
target["area"] = area
target["iscrowd"] = iscrowd
self.transforms() in PennFudanDataset class is set to a list of transforms comprising [transforms.ToTensor(), transforms.Compose()], the return value from get_transform() while instantiating the dataset with:
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
The transforms transforms.Compose() comes from T, a custom transform written for object detection task. Specifically, in the __call__ of RandomHorizontalFlip(), we process both the image and target (e.g., mask, keypoints):
For the sake of completeness, I borrow the code from the github repo:
def __call__(self, image, target):
if random.random() < self.prob:
height, width = image.shape[-2:]
image = image.flip(-1)
bbox = target["boxes"]
bbox[:, [0, 2]] = width - bbox[:, [2, 0]]
target["boxes"] = bbox
if "masks" in target:
target["masks"] = target["masks"].flip(-1)
if "keypoints" in target:
keypoints = target["keypoints"]
keypoints = _flip_coco_person_keypoints(keypoints, width)
target["keypoints"] = keypoints
return image, target
Here, we can understand how they perform the flipping on the masks and keypoints in accordance with the image.

How to calc the similarity of two images

I'm trying to examine two images for similarity with the usage of SIFT. The result should be a percentage.
I have understood how to extract the features and descriptors from the images using OpenCV an its lib. If I calculate the distance between the descriptors I don't get a percentage. I have not yet sorted my head correctly how to calculate this.
Can someone help me put the missing piece in my head together properly?
alg = cv2.xfeatures2d.SIFT_create()
trainFiles = getPaths(dirTrain)
images = []
for file in trainFiles:
img = cv2.imread(file)
images.append(img)
np_images = np.array(images)
descriptors = np.zeros((1,128)) #Matrix to hold the descriptors
for i,img in enumerate(np_images):
kp, des = alg.detectAndCompute(img,None)
descriptors = np.concatenate((descriptors,des),axis=0)
print('Processed image {} of {}'.format(i,len(np_images)))
descriptors = descriptors[1:,:]
a = descriptors[0]
b = descriptors[1]
# euclidean distance
dist = np.linalg.norm(a-b)

Getting and comparing BOVW histograms for image similarity

I'm building an image similarity program and, as I am a begginer in CV, I talked with an expert who gave me the following recommended steps to get the really basic functionality:
Extract keypoints (DoG, Harris, etc.) and local invariant descriptors (SIFT, SURF, etc.) from all images.
Cluster them to form a codebook (bag of visual words dictionary; BOVW)
Quantize the features from each image into a BOVW histogram
Compare the BOVW histograms for each image (typically using chi-squared, cosine, or euclidean distance)
The point number one is easy, but I start getting confused at step 2. This is the code I've written so far:
import cv2
import numpy as np
dictionarySize = 20
BOW = cv2.BOWKMeansTrainer(dictionarySize)
for imgpath in ['testimg/testcropped1.jpg','testimg/testcropped2.jpg','testimg/testcropped3.jpg']:
img = cv2.imread(imgpath)
gray= cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
dst = cv2.cornerHarris(gray,2,3,0.04)
sift = cv2.xfeatures2d.SIFT_create()
kp = sift.detect(gray,None)
kp,des = sift.compute(img,kp)
img=cv2.drawKeypoints(gray,kp,img)
cv2.imwrite('%s_keypoints.jpg' % imgpath, img)
BOW.add(des)
I extract some features using SIFT and then I try to build a BOVW o each image descriptor. The problem is I have no idea if this is correct and how to get the histograms.

K-means color clustering - omit background pixels with masked numpy arrays

I'm trying to find the 3 dominant colours of an several images using K-means clustering. The problem I'm facing is that K-means also clusters the background of the image. I am using Python 2.7 and OpenCV 3
All images have the same grey background of the following RGB colour: 150,150,150. To avoid that K-means also clusters the background color, I created a masked array which masks all '150' pixel values from the original image array, theoretically leaving only the non-background pixels in the array for K-Means to work with. However, when I run my script, it still returns the grey as one of the dominant colours.
My question: is a masked array the way to go (and did I do something wrong) or are there better alternatives to somehow exclude pixels from K-means clustering?
Please find my code below:
from sklearn.cluster import KMeans
from sklearn import metrics
import cv2
import numpy as np
def centroid_histogram(clt):
numLabels = np.arange(0, len(np.unique(clt.labels_)) + 1)
(hist, _) = np.histogram(clt.labels_, bins=numLabels)
hist = hist.astype("float")
hist /= hist.sum()
return hist
image = cv2.imread("test1.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
h, w, _ = image.shape
w_new = int(100 * w / max(w, h))
h_new = int(100 * h / max(w, h))
image = cv2.resize(image, (w_new, h_new))
image_array = image.reshape((image.shape[0] * image.shape[1], 3))
image_array = np.ma.masked_values(image_array,150)
clt = KMeans(n_clusters=3)
clt.fit(image_array)
hist = centroid_histogram(clt)
zipped = zip(hist, clt.cluster_centers_)
zipped.sort(reverse=True, key=lambda x: x[0])
hist, clt.cluster_centers = zip(*zipped)
print(clt.cluster_centers_)
If you want to extract the values of pixels other than your background, you can use numpy indexation :
img2=image_array[image_array!=[150,150,150]]
img2=img2.reshape((len(img2)/3,3))
This will yield the list of pixels which are not [150,150,150].
However, it does not preserve the structure of the image, just gives you the list of pixels values. I can't really remember, but maybe for K-means you need to give the whole image, i.e. you also need to feed it the position of the pixels ? But in that case, no masking will ever help because masking is just replacing values of certain pixels by another, not getting rid of pixels all together.

how to find the pca of an image using python and opencv?

I am developing an image classifier using svm.In the feature extraction phase can i use pca as feature.How to find the pca of an image using python and opencv.what my plan is
Find pca of each image in training set and store it in a array.It may be list of lists
Store class labels in another list
pass this as argument to svm
Am i going in right Direction.Please help me
Yes you can do PCA+SVM, some might argue that PCA is not the best feature to use or SVM is not the best classification algorithm. But hey, have a good start is better than sitting around.
To do PCA with OpenCV, try something like (I haven't verified the codes, just to get you an idea):
import os
import cv2
import numpy as np
# Construct the input matrix
in_matrix = None
for f in os.listdir('dirpath'):
# Read the image in as a gray level image. Some modifications
# of the codes are needed if you want to read it in as a color
# image. For simplicity, let's use gray level images for now.
im = cv2.imread(os.path.join('dirpath', f), cv2.IMREAD_GRAYSCALE)
# Assume your images are all the same size, width w, and height h.
# If not, let's resize them to w * h first with cv2.resize(..)
vec = im.reshape(w * h)
# stack them up to form the matrix
try:
in_matrix = np.vstack((in_matrix, vec))
except:
in_matrix = vec
# PCA
if in_matrix is not None:
mean, eigenvectors = cv2.PCACompute(in_matrix, np.mean(in_matrix, axis=0).reshape(1,-1))

Categories