Finding the most similar documents (nearest neighbours) from a set of documents - python

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.

So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.

You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...

Related

What does Similarity Score mean in gensim?

I have used Gensim library to find the similarity between a sentence against a collection of paragraphs, a dataset of texts. I have used Cosine similarity, Soft cosine similarity and Mover measures separately. Gensim returns a list of items including docid and similarity score. For Cosine similarity and Soft cosine similarity, I guess the similarity score is the cosine between the vectors. Am I right?
In Gensim documents, they wrote it is the semantic relatedness, and no extra explanation. I have search a lot, but did not find any answer. Any help please
Usually by 'similarity', people are seeking to find a measure semantic relatedness - but whether the particular values calculated achieve that will depend on lots of other factors, such as the sufficiency of training data & choice of other appropriate parameters.
Within each code context, 'similarity' has no more and no less meaning than how it's calculated right there - usually, that's 'cosine similarity between vector representations'. (When there's no other hints it means something different, 'cosine similarity' is typically a safe starting assumption.)
But really: the meaning of 'similarity' at each use is no more and no less than whatever that one code path's docs/source-code dictate.
(I realize that may seem an indirect & unsatisfying answer. If there are specific uses in context in Gensim source/docs/example where the meaning is unclear, you could point those out & I might be able to clarify those more.)

Computing text similarity against many documents

I'm trying to compute the text similarity of a search term, A, like "How to make chickens" against a collection of other search terms. To compute similarity I'm using the cosine distance and TF-IDF to transform A into a vector. I'd like to compare the similarity of A against all documents at once.
Currently, my approach involves computing the cosine similarity for A against every other document one at a time, iteratively. I have 100 documents I'm comparing against. If the result of cos_sim(A, X) > 0.8 then I break and say "cool, this is similar".
However, I feel like this might not be a true representation of the overall similarity. Is there a way to pre-compute a vector(s) for my 100 documents at runtime, and every time I see a new search query A, I can compare against this pre-defined vector/document?
I believe I can achieve this by simply combining all documents into one... feels rough though. What are the pros and & cons, and possible solutions? Extra points for efficiency!
This problem is essentially the traditional search problem: Have you tried putting your documents into something like Lucene (Java) or Whoosh (python)? I think they have a cosine-similarity model (but even if they don't, the default may be better).
The general trick all search engines use is that in general, documents are sparse. This means to compute the similarity (e.g., cosine similarity) it only matters what the lengths of the documents are (known way ahead of time) and the terms that they both contain; you can organize a data structure like a back-of-the-book index, called an inverted index that can quickly tell you which documents will get at least a non-zero score.
With only 100 documents, a search engine is probably overkill; you want to pre-compute the TF-IDF vectors and keep them in a numpy matrix. You can then use numpy operations to compute the dot product all at once for all the documents -- it will output a 1x100 vector of the numerators you need. The denominators can similarly be precomputed. A numpy.max(numpy.dot(query, docs)/denom) will then probably be fast enough.
You should profile your code, but I would bet your vector extraction is the slow part; but you should only have to do that once for all queries.
If you had thousands or millions of documents to compare against, you could look into SciKit learn's K-nearest-neighbor structures (e.g., Ball Tree or KDTree, or things like Facebook's FAISS library.

using LDA for dimension reduction / clustering

I'm currently facing a text mining problem where my goal is to identify clusters within a corpus of short texts.
The idea is, that these clusters represent some kind of technical/domain specific content which all members of the respective cluster have in common.
The final evaluation of the clustering has to be made from a domain knowledge based perspetive.
I worked myself through a bunch of different approaches.
Topic modeling with lda seems a good one to start with.
So each of my documents is represented through a mixture of different topics (which are based on the coherent occrurance of single words or n_grams)
My first idea was to use the resulting topics as clusters/groups to group my documents.
But one single document can consist of different topics, so I'm not sure wether this is a good idea.
Furthermore, as LDA is not using a distance measure to calculate it's topics, I'm lacking some kind of metric to evalute my lda based clusters. Because of the fact, that I'm missing a given ground truth I'm bound to methods, which are not bound to a given ground truth. I used the silhouette score to evalute my clusters, but while this metric is based on distances, lda is not. I'm not sure wether it actualy makes sense.
My second thought was to use the lda results as a preprocessing step for dimension reduction.
On these new Input vector I could apply distance based clustering methods like agglomerative clustering, kmeans, dbscan.
I also found some posts and papers, which pointed to self organizing maps to solve that kind of problem. Is this approach worth following it, compared to the methods described above?
Is it a reasonable approach to use lda topics as clusters or as a preprocessing step?
What are metrics to evalute non distance approaches like lda?
Are there any other approaches which I should take into account?

Arranging documents in a grid in accordance with the content similarity

How is it possible to arrange documents in to a space (say multiple grids), so that the position in which they are placed in, contains information about how similar they are to other documents. I looked in to K-means clustering, but it is a bit computationally intensive if data is large. I'm looking for something like hashing the contents of the document, so that they can fit in a large space and documents that are similar would be having similar hashes and distance between them would be small. In this case, it would be easy to find documents similar to a given document, with out doing much extra work.
The result could be something similar to the picture below. In this case music documents are near film documents but far from documents related to computers. The box can be considered as the whole world of documents.
Any help would be greatly appreciated.
Thanks
jvc007
One way to introduce a distance or similarity measure between documents is:
first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.
Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities
should help to visualize the documents in a two dimensional plot.
The problem of mapping high-dimensional data to low dimensional space while preserving similarity can be solved using Self-organizing map (SOM or Kohonen network). I have already seen some applications on documents.
I don't know about any python implementation (there might be one), but there is a good one for Matlab (SOM toolbox).
I think what you're looking for is locality-sensitive hashing. See this answer for a nice, graphical explanation and sample code.

Feature Selection and Reduction for Text Classification

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).
This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Additional
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.
I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.
There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Edit:
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
Edit2
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
#Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
Check the project for details: https://pypi.org/project/TextFeatureSelection/
Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.
I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).

Categories