Document Clustering and Visualization

Document Clustering and Visualization - python

I would like to test if a set of documents have some special similarity, looking on a graph built with each one's vector representation, showed together with a text dataset of other documents. I guess that they will be together in a visualization.
The solution is to use doc2vec to calculate the vector for each document and plot it? Can it be done in a unsupervised way? Which python library should I use to get those beautiful 2D and 3D representations of Word2vec?

Not sure of what you're asking but if you want a way to check if vector are of the same type you could use K-Means.
K-Means make a number K of cluster out of a list of vector, so if you choose a good K (not too low so it will search for something but not too high so it will not be too discriminant) it could work.
K-Means grossly work that way:
init_center(K) # randomly set K vector that will be the center of your cluster
while not converge(): # This one is tricky as you can find a lot of way to check for the convergence, the easiest is to check if your center has moved since the last itteration
associate_vector() # Here you associate all the vectors to the closest center
re_calculate_center() # And now you put the center at the... well center of their point, you can do that just by doing the mean of all the vector of the cluster.
This gif is probably clearer than me:
And this article (where this gif is from) is really clearer than me, even if he talk for java here:
https://picoledelimao.github.io/blog/2016/03/12/multithreaded-k-means-in-java/

Related

What is the most effective way to compare similarity of two images (which contain buildings)

I have an image comparison problem.
To be more precise, I have a test image (a building taken from outside, could be a house, an apartment, a big public building) and I need to compare it against 100.000 other building images in my DB.
Is there an effective method to output top X images (which are most similar, if not the same) in the most accurate way possible to-date?
A number of StackOverflow answers guided me more towards feature-matching OpenCV but sadly I failed to progress (hitting bad accuracy and therefore roadblocks in terms of a way to improve it).
For instance, this is a test image that I would like to compare (white house - South). test_image
and these are the images in my DB pic1_DB pic2_DB pic3_DB pic4_DB pic5_DB
The desired/ideal output would be "the test image is the same building as that in Pic1, Pic3, Pic4 and Pic5".
And the test image is different significantly from Pic2.
Thank you all.

matchTemplate wont work well in this case, as they need exact size and viewpoint match.
Opencv Feature based method might work. You can try SIFT based method first. But the general assumption is that the rotation, translation, perspective changes are bounded. It means that for adjacent iamge pair, it can not be 1 taken from 20m and other picture taken from 10km away. Assumptions are made so that the feature can be associated.
Deep learning-based method might work well given enough datasets. take POSEnet for reference. It can matches same building from different geometry view point and associate them correctly.
Each method has pros and cons. You have to decide which method you can afford to use
Regards
Dr. Yuan Shenghai

For pixel-wise similarity, you may use res = cv2.matchTemplate(img1, img2, cv2.TM_CCOEFF_NORMED) \\ similarity = res[0][0], which adopts standard corralation coefficient to evaluate simlarity (first assure two inputted image is in the same size).
For chromatic similarity, you may calculate histogram of each image by cv2.calHist, then measure the similarity between each histogram by metric of your choice.
For intuitive similarity, I'm afraid you have to use some machine learning or deep learning method since "similar" is a rather vague concept here.

Comparing feature extractors (or comparing aligned images)

I'd like to compare ORB, SIFT, BRISK, AKAZE, etc. to find which works best for my specific image set. I'm interested in the final alignment of images.
Is there a standard way to do it?
I'm considering this solution: take each algorithm, extract the features, compute the homography and transform the image.
Now I need to check which transformed image is closer to the target template.
Maybe I can repeat the process with the target template and the transformed image and look for the homography matrix closest to the identity but I'm not sure how to compute this closeness exactly. And I'm not sure which algorithm should I use for this check, I suppose a fixed one.
Or I could do some pixel level comparison between the images using a perceptual difference hash (dHash). But I suspect the the following hamming distance may not be very good for images that will be nearly identical.
I could blur them and do a simple subtraction but sounds quite weak.
Thanks for any suggestions.
EDIT: I have thousands of images to test. These are real world pictures. Images are of documents of different kinds, some with a lot of graphics, others mostly geometrical. I have about 30 different templates. I suspect different templates works best with different algorithms (I know in advance the template so I could pick the best one).
Right now I use cv2.matchTemplate to find some reference patches in the transformed images and I compare their locations to the reference ones. It works but I'd like to improve over this.

From your question, it seems like the task is not to compare the feature extractors themselves, but rather to find which type of feature extractor leads to the best alignment.
For this, you need two things:
a way to perform the alignment using the features from different extractors
a way to check the accuracy of the alignment
The algorithm you suggested is a good approach for doing the alignment. To check if accuracy, you need to know what is a good alignment.
You may start with an alignment you already know. And the easiest way to know the alignment between two images is if you made the inverse operation yourself. For example, starting with one image, you rotate it some amount, you translate/crop/scale or combine all this operations. Knowing how you obtained the image, you can obtain your ideal alignment (the one that undoes your operations).
Then, having the ideal alignment and the alignment generated by your algorithm, you can use one metric to evaluate its accuracy, depending on your definition of "good alignment".

DBSCAN plotting Non-geometrical-Data

I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.

I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.

Supervised Machine Learning: Classify types of clusters of data based on shape and density (Python)

I have multiple sets of data, and in each set of data there is a region that is somewhat banana shaped and two regions that are dense blobs. I have been able to differentiate these regions from the rest of the data using a DBSCAN algorithm, but I'd like to use a supervised algorithm to have the program then know which cluster is the banana, and which two clusters are the dense blobs, and I'm not sure where to start.
As there are 3 categories (banana, blob, neither), would doing two separate logistic regressions be the best approach (evaluate if it is banana or not-banana and if it is blob or not-blob)? or is there a good way to incorporate all 3 categories into one neural network?
Here are three data sets. In each, the banana is red. In the 1st, the two blobs are green and blue, in the 2nd the blobs are cyan and green, and in the the 3rd the blobs are blue and green. I'd like the program to (now that is has differentiated the different regions, to then label the banana and blob regions so I don't have to hand pick them every time I run the code.

As you are using python, one of the best options would be to start with some big library, offering many different approaches so you can choose which one suits you the best. One of such libraries is sklearn http://scikit-learn.org/stable/ .
Getting back to the problem itself. What are the models you should try?
Support Vector Machines - this model has been around for a while, and became a gold standard in many fields, mostly due to its elegant mathematical interpretation and ease of use (it has much less parameters to worry about then classical neural networks for instance). It is a binary classification model, but library automaticaly will create a multi-classifier version for you
Decision tree - very easy to understand, yet creates quite "rough" decision boundaries
Random forest - model often used in the more statistical community,
K-nearest neighours - most simple approach, but if you can so easily define shapes of your data, it will provide very good results, while remaining very easy to understand
Of course there are many others, but I would recommend to start with these ones. All of them support multi-class classification, so you do not need to worry how to encode the problem with three classes, simply create data in the form of two matrices x and y where x are input values and y is a vector of corresponding classes (eg. numbers from 1 to 3).
Visualization of different classifiers from the library:
So it remains a question how to represent shape of a cluster - we need a fixed length real valued vector, so what can features actually represent?
center of mass (if position matters)
skewness/kurtosis
covariance matrix (or its eigenvalues) (if rotation matters)
some kind of local density estimation
histograms of some statistics (like histogram of pairwise Euclidean distances between
pairs of points on the shape)
many, many more!
There is quite comprehensive list and detailed overview here (for three-dimensional objects):
http://web.ist.utl.pt/alfredo.ferreira/publications/DecorAR-Surveyon3DShapedescriptors.pdf
There is also quite informative presentation:
http://www.global-edge.titech.ac.jp/faculty/hamid/courses/shapeAnalysis/files/3.A.ShapeRepresentation.pdf
Describing some descriptors and how to make them scale/position/rotation invariant (if it is relevant here)

Could Neural networks help , the "pybrain" library might be the best for it.
You could set up the neural net as a feed forward network. set it so that there is an output for each class of object you expect the data to contain.
Edit :sorry if I have completely misinterpreted the question. I'm assuming you have preexisting data you can feed to train the networks to differentiate clusters.
If there are 3 categories you could have 3 outputs to the NN or perhaps a single NN for each one that simply outputs a true or false value.

I believe you are still unclear about what you want to achieve.
That of course makes it hard to give you a good answer.
Your data seems to be 3D. In 3D you could for example compute the alpha shape of a cluster, and check if it is convex. Because your "banana" probably is not convex, while your blobs are.
You could also measure e.g. whether the cluster center actually is inside your cluster. If it isn't, the cluster is not a blob. You can measure if the extends along the three axes are the same or not.
But in the end, you need some notion of "banana".

Classifying a Distribution of Points for Object Identification

I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.

I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.

You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc

You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.