Computing text similarity against many documents

Computing text similarity against many documents - python

I'm trying to compute the text similarity of a search term, A, like "How to make chickens" against a collection of other search terms. To compute similarity I'm using the cosine distance and TF-IDF to transform A into a vector. I'd like to compare the similarity of A against all documents at once.
Currently, my approach involves computing the cosine similarity for A against every other document one at a time, iteratively. I have 100 documents I'm comparing against. If the result of cos_sim(A, X) > 0.8 then I break and say "cool, this is similar".
However, I feel like this might not be a true representation of the overall similarity. Is there a way to pre-compute a vector(s) for my 100 documents at runtime, and every time I see a new search query A, I can compare against this pre-defined vector/document?
I believe I can achieve this by simply combining all documents into one... feels rough though. What are the pros and & cons, and possible solutions? Extra points for efficiency!

This problem is essentially the traditional search problem: Have you tried putting your documents into something like Lucene (Java) or Whoosh (python)? I think they have a cosine-similarity model (but even if they don't, the default may be better).
The general trick all search engines use is that in general, documents are sparse. This means to compute the similarity (e.g., cosine similarity) it only matters what the lengths of the documents are (known way ahead of time) and the terms that they both contain; you can organize a data structure like a back-of-the-book index, called an inverted index that can quickly tell you which documents will get at least a non-zero score.
With only 100 documents, a search engine is probably overkill; you want to pre-compute the TF-IDF vectors and keep them in a numpy matrix. You can then use numpy operations to compute the dot product all at once for all the documents -- it will output a 1x100 vector of the numerators you need. The denominators can similarly be precomputed. A numpy.max(numpy.dot(query, docs)/denom) will then probably be fast enough.
You should profile your code, but I would bet your vector extraction is the slow part; but you should only have to do that once for all queries.
If you had thousands or millions of documents to compare against, you could look into SciKit learn's K-nearest-neighbor structures (e.g., Ball Tree or KDTree, or things like Facebook's FAISS library.

Related

How to speed up computing sentence similarity using spacy in Python?

I have the following code which takes in 2 sentences and return the similarity:
nlp = spacy.load("en_core_web_md/en_core_web_md-3.2.0")
def get_categories_nlp_sim(cat_1, cat_2):
if (cat_1 != cat_1) or (cat_2 != cat_2):
s = np.nan
else:
doc1 = nlp(cat_1)
doc2 = nlp(cat_2)
s = doc1.similarity(doc2)
return s
So, this seems to give reasonable results but when using it in a for loop of ~1M rows, it just becomes too slow to use.
Any ideas on how to speed this up? or perhaps another NLP library that could do the same thing faster?
Thanks!

If you truly have 1m rows and compare each of them as pairs you would have an astronomical number of comparisons. SpaCys nlp() does a whole lot other than just the stuff needed for the similarity.
What SpaCys similarity() does is use the processed documents vector and calculate a cosine similarity (document vector = average over word vectors), check out source code.
So the probably most efficient possibly way for you to replicate a similarity for this many pairs would be to get a semantic token representation vector for each unique token in the entire corpus using something like Gensims pretrained word2vec model, then for each row calculate the average of the vectors of the tokens in it and then once you have those 1m document vectors as numpy arrays you calculate the cosine similarities using numpy or scipy which is drastically faster than pure Python.
Also check out this thread which is a similar question to yours: Efficient way for Computing the Similarity of Multiple Documents using Spacy
I'm not sure what exactly your main goal is in your code but I am pretty sure that calculating each pairwise similarity is not required or at least not the best way to go ahead and reach that goal, so please share more about the context you need this method in.

After going through the answers and this other related thread Efficient way for Computing the Similarity of Multiple Documents using Spacy, I managed to get a significant speed-up.
I am now using the following code:
nlp = spacy.load(en_core_web_md, exclude=["tagger", "parser", "senter", "attribute_ruler", "lemmatizer", "ner"])
processed_docs_1 = nlp.pipe(texts_1)
processed_docs_2 = nlp.pipe(texts_2)
for _ in range(len(texts_1)):
doc_1 = next(processed_docs_1)
doc_2 = next(processed_docs_2)
s = doc_1.similarity(doc_2)
where texts_1 and texts_2 are of the same length consisting of the pairs to compare (e.g. texts_1[i] with texts_2[i]).
Adding the "exclude" in spacy load resulted in ~ 2x speed up.
Using nlp.pipe as opposed to calling nlp inside the loop resulted in a ~10x speed up. So combined, I obtain ~20x speed up.

Efficient way for Computing the Similarity of Multiple Documents using Spacy

I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?

Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.

I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.

How to find text similarity within millions of entries?

Having used Spacy to find similarity across few texts, now I'm trying to find similar texts in millions of entries (instantaneously).
I have an app with millions of texts and I'd like to present the user with similar texts if they ask to.
How sites like StackOverflow find similar questions so fast?
I can imagine 2 approaches:
Each time a text is inserted, the entire DB is compared and a link is done between both questions (in a intermediate table with both foreign keys)
Each time a text is inserted, the vector is inserted in a field associated with this text. Whenever a user asks for similar texts, its "searches" the DB for similar texts.
My doubt is with the second choice. Storing the word vector is enough for searching quickly for similar texts?

Comparing all the texts every time a new request comes in is infeasible.
To be really fast on large datasets I can recommend Locality-sensitive Hasing (LSH). It gives you entries that are similar with high probability. It significantly reduces the Complexity of your algorithm.
However, you have to train your algorithm once - that may take time - but after that it's very fast.
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Here is a tutorial that seems close to your application:
https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/

You want a function that can map quickly from a text, into a multi-dimensional space. Your collection of documents should be indexed with respect to that space such that you can quickly find the shortest-distance match between your text, and those in the space.
Algorithms exist that will speed up that indexing process - but could be as simple as sub-indexing the space into shards or blocks on a less granular basis and narrowing down the search like that.
One simple way of defining such a space might be on term-frequency (TF), term-frequency-inverse document frequency (TFIDF) - but without defining a limit on your vocabulary size, these can suffer from space/accuracy issues - still, with a vocabulary of the most specific 100 words in a corpus, you should be able to get a reasonable indication of similarity that would scale to millions of results. It depends on your corpus.
There are plenty of alternative features you might consider - but all of them will resolve to having a reliable method of transforming your document into a geometric vector, which you can then interrogate for similarity.

Calculating similarity measure between millions of documents

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.
Below is an example of data.
skills hobbies certification education
Java fishing PMP MS
Python reading novel SCM BS
C# video game PMP B.Tech.
C++ fishing PMP MS
so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.
Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.
Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?
document = all_document_in_db
For i in document:
for j in document:
if i != j :
compute_similarity(i,j)

One way to speed up would be to ensure you don't calculate similarity both ways. your current pseudocode will compare i to j and j to i. instead of iterating j over the whole document, iterate over document[i+1:], i.e. only entries after i. This will reduce your calls to compute_similarity by half.
The most suitable data structure for this kind of comparison would be an adjacency matrix. This will be an n * n matrix (n is the number of members in your data set), where matrix[i][j] is the similarity between members i and j. You can populate this matrix fully while still only half-iterating over j, by just simultaneously assigning matrix[i][j] and matrix[j][i] with one call to compute_similarity.
Beyond this, I can't think of any way to speed up this process; you will need to make at least n * (n - 1) / 2 calls to compute_similarity. Think of it like a handshake problem; if every member must be compared to ('shake hands with') every other member at least once, then the lower bound is n * (n - 1) / 2. But I welcome other input!

I think what you want is some sort of clustering algorithm. You think of each row of your data as giving a point in a multi-dimensional space. You then want to look for other 'points' that are nearby. Not all the dimensions of your data will produce good clusters so you want to analyze your data for which dimensions will be significant for generation of clusters and reduce the complexity of looking for similar records by mapping to a lower dimension of the data. scikit-learn has some good routines for dimensional analysis and clustering as well as some of the best documentation for helping you to decide which routines to apply to your data. For actually doing the analysis I think you might do well to purchase cloud time with AWS or Google AppEngine. I believe both can give you access to Hadoop clusters with Anaconda (which includes scikit-learn) available on the nodes. Detailed instructions on either of these topics (clustering, cloud computing) are beyond a simple answer. When you get stuck post another question.

With 100 mln document, you need 500,000 bln comparisons. No, you cannot do this in Python.
The most feasible solution (aside from using a supercomputer) is to calculate the similarity scores in C/C++.
Read the whole database and enumerate each skill, hobby, certification, and education. This operation takes a linear time, assuming that your index look-ups are "smart" and take constant time.
Create a C/C++ struct with four numeric fields: skill, hobby, certification, and education.
Run a nested loop that subtracts each struct from all other structs fieldwise and uses bit-level arithmetic to assess the similarity.
Save the results into a file and make them available to the Python program, if necessary.

Actually, I believe you need to compute a matrix representation of the documents and only call the compute_similarity once. This will invoke a vectorized implementation of the algo on all pairs of rows of features in the X matrix (the first parameter assuming sci-kit learn). You'll be surprised by the performance. If the attempt to calculate this in one call exceeds your RAM you can try to chunk.

Finding the most similar documents (nearest neighbours) from a set of documents

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.

So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.

You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.