How to find text similarity within millions of entries?

How to find text similarity within millions of entries? - python

Having used Spacy to find similarity across few texts, now I'm trying to find similar texts in millions of entries (instantaneously).
I have an app with millions of texts and I'd like to present the user with similar texts if they ask to.
How sites like StackOverflow find similar questions so fast?
I can imagine 2 approaches:
Each time a text is inserted, the entire DB is compared and a link is done between both questions (in a intermediate table with both foreign keys)
Each time a text is inserted, the vector is inserted in a field associated with this text. Whenever a user asks for similar texts, its "searches" the DB for similar texts.
My doubt is with the second choice. Storing the word vector is enough for searching quickly for similar texts?

Comparing all the texts every time a new request comes in is infeasible.
To be really fast on large datasets I can recommend Locality-sensitive Hasing (LSH). It gives you entries that are similar with high probability. It significantly reduces the Complexity of your algorithm.
However, you have to train your algorithm once - that may take time - but after that it's very fast.
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Here is a tutorial that seems close to your application:
https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/

You want a function that can map quickly from a text, into a multi-dimensional space. Your collection of documents should be indexed with respect to that space such that you can quickly find the shortest-distance match between your text, and those in the space.
Algorithms exist that will speed up that indexing process - but could be as simple as sub-indexing the space into shards or blocks on a less granular basis and narrowing down the search like that.
One simple way of defining such a space might be on term-frequency (TF), term-frequency-inverse document frequency (TFIDF) - but without defining a limit on your vocabulary size, these can suffer from space/accuracy issues - still, with a vocabulary of the most specific 100 words in a corpus, you should be able to get a reasonable indication of similarity that would scale to millions of results. It depends on your corpus.
There are plenty of alternative features you might consider - but all of them will resolve to having a reliable method of transforming your document into a geometric vector, which you can then interrogate for similarity.

Related

How do I calculate the similarity of a word or couple of words compared to a document using a doc2vec model?

In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.

There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector() method. However, like Doc2Vec in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector(), so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1) does this, or PV-DBOW (dm=0) if you add the optional interleaved word-vector training (dbow_words=1). (If you use dm=0, dbow_words=0, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.

Feature space reduction for tag prediction

I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.
Work Done
Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.
tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())
Questions
How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
Once the feature set is reduced, the way I have planned to use it is as follows:
Preprocess the test data.
Find the unigrams and bigrams.
For the ones stored in redis, find the corresponding best-k tags
Apply some kind of weight for the title and body text
Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.
Are there any suggestions on how to improve upon this? Can a classifier come in handy?
Footnote
I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.

Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.

A baseline statistical approach would treat this as a classification problem. Features are bags-of-words processed by a maximum entropy classifier like Mallet http://mallet.cs.umass.edu/classification.php. Maxent (aka logistic regression) is good at handling large feature spaces. Take the probability associated with each each tag (i.e., the class labels) and choose some decision threshold that gives you a precision/recall tradeoff that works for your project. Some of the Mallet documentation even mentions topic classification, which is very similar to what you are trying to do.
The open questions are how well Mallet handles the size of your data (which isn't that big) and whether this particular tool is a non-starter with the technology stack you mentioned. You might be able to train offline (dump the reddis database to a text file in Mallet's feature format) and run the Mallet-learned model in Python. Evaluating a maxent model is simple. If you want to stay in Python and have this be more automated, there are Python-based maxent implementations in NLTK and probably in scikit-learn. This approach is not at all state-of-the-art, but it'll work okay and be a decent baseline with which to compare more complicated methods.

Finding the most similar documents (nearest neighbours) from a set of documents

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.

So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.

You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...

Arranging documents in a grid in accordance with the content similarity

How is it possible to arrange documents in to a space (say multiple grids), so that the position in which they are placed in, contains information about how similar they are to other documents. I looked in to K-means clustering, but it is a bit computationally intensive if data is large. I'm looking for something like hashing the contents of the document, so that they can fit in a large space and documents that are similar would be having similar hashes and distance between them would be small. In this case, it would be easy to find documents similar to a given document, with out doing much extra work.
The result could be something similar to the picture below. In this case music documents are near film documents but far from documents related to computers. The box can be considered as the whole world of documents.
Any help would be greatly appreciated.
Thanks
jvc007

One way to introduce a distance or similarity measure between documents is:
first encode your documents as vectors, eg using TF-IDF (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
the scalar-product between two vectors related to two documents give you a measure about the similarity of the documents. The larger this value is, the higher is the similarity.
Using MDS (http://en.wikipedia.org/wiki/Multidimensional_scaling) on these similarities
should help to visualize the documents in a two dimensional plot.

The problem of mapping high-dimensional data to low dimensional space while preserving similarity can be solved using Self-organizing map (SOM or Kohonen network). I have already seen some applications on documents.
I don't know about any python implementation (there might be one), but there is a good one for Matlab (SOM toolbox).

I think what you're looking for is locality-sensitive hashing. See this answer for a nice, graphical explanation and sample code.

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!

Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)

That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.