Calculating similarity measure between millions of documents

Calculating similarity measure between millions of documents - python

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.
Below is an example of data.
skills hobbies certification education
Java fishing PMP MS
Python reading novel SCM BS
C# video game PMP B.Tech.
C++ fishing PMP MS
so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.
Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.
Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?
document = all_document_in_db
For i in document:
for j in document:
if i != j :
compute_similarity(i,j)

One way to speed up would be to ensure you don't calculate similarity both ways. your current pseudocode will compare i to j and j to i. instead of iterating j over the whole document, iterate over document[i+1:], i.e. only entries after i. This will reduce your calls to compute_similarity by half.
The most suitable data structure for this kind of comparison would be an adjacency matrix. This will be an n * n matrix (n is the number of members in your data set), where matrix[i][j] is the similarity between members i and j. You can populate this matrix fully while still only half-iterating over j, by just simultaneously assigning matrix[i][j] and matrix[j][i] with one call to compute_similarity.
Beyond this, I can't think of any way to speed up this process; you will need to make at least n * (n - 1) / 2 calls to compute_similarity. Think of it like a handshake problem; if every member must be compared to ('shake hands with') every other member at least once, then the lower bound is n * (n - 1) / 2. But I welcome other input!

I think what you want is some sort of clustering algorithm. You think of each row of your data as giving a point in a multi-dimensional space. You then want to look for other 'points' that are nearby. Not all the dimensions of your data will produce good clusters so you want to analyze your data for which dimensions will be significant for generation of clusters and reduce the complexity of looking for similar records by mapping to a lower dimension of the data. scikit-learn has some good routines for dimensional analysis and clustering as well as some of the best documentation for helping you to decide which routines to apply to your data. For actually doing the analysis I think you might do well to purchase cloud time with AWS or Google AppEngine. I believe both can give you access to Hadoop clusters with Anaconda (which includes scikit-learn) available on the nodes. Detailed instructions on either of these topics (clustering, cloud computing) are beyond a simple answer. When you get stuck post another question.

With 100 mln document, you need 500,000 bln comparisons. No, you cannot do this in Python.
The most feasible solution (aside from using a supercomputer) is to calculate the similarity scores in C/C++.
Read the whole database and enumerate each skill, hobby, certification, and education. This operation takes a linear time, assuming that your index look-ups are "smart" and take constant time.
Create a C/C++ struct with four numeric fields: skill, hobby, certification, and education.
Run a nested loop that subtracts each struct from all other structs fieldwise and uses bit-level arithmetic to assess the similarity.
Save the results into a file and make them available to the Python program, if necessary.

Actually, I believe you need to compute a matrix representation of the documents and only call the compute_similarity once. This will invoke a vectorized implementation of the algo on all pairs of rows of features in the X matrix (the first parameter assuming sci-kit learn). You'll be surprised by the performance. If the attempt to calculate this in one call exceeds your RAM you can try to chunk.

Related

Efficient way for Computing the Similarity of Multiple Documents using Spacy

I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:
Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
doc = nlp(train_list[i]) #the train list contains the single documents
doc_matrix[i] = doc.vector
Is there for example a way to parallelize this?

Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.
Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.
That said, also check the spaCy speed FAQ.

I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.
For generally speeding up the SpaCy processing:
Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.
For your actual "find the n most similar" problem:
This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.
Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.

How do you solve long tail problem in a recommendation system?

I want to make a movie recommendation system using the binary ratings that is whether a person has seen the movie or not! I am using various cosine similarity techniques and all but the issue is the Long Tail
in Recommendation System. I am not able to find any concrete solution which uses just viewed or not (i.e. either 0 or 1) and not the ratings as such for the recommendation? What other popular algorithms can be used for the same. I need to remove the long tail issue,
I have used Adaptive Clustering but it needs many Derived Variables and those are not present here.
Used other ways like Total Clustering but no use.
Used Popularity Sensitive Clustering but same issue.
Been stuck here in this long tail issue but not getting even a good implementation for my work or a research paper that helps but nothing.
Everyone is using either ratings or the user data but my work doesn't have any user info and neither is it having any ratings just the binary values.

The Long Tail issue in recommendation systems basically is about how to give users recommendation of items that do not have a lot of interactions(ratings/likes) etc. As similarity algorithms like cosine similarity and clustering algorithms fails in recommending them. You need to look into diversity increasing algorithms.
What I mean is rather than calculating similarity try calculating dissimilarity.
Here R is recommendation list, d(i, j) is dissimilarity.
You can use surprise to generate R here using matrix factorization algorithms.
Also, when you generate a user vs. item matrix where matrix[user_i][item_j] denote rating you can convert it to 1 to show rating and 0 otherwise and it will still work. Also, these binary ratings generally are call interaction the user had with the item.

Computing text similarity against many documents

I'm trying to compute the text similarity of a search term, A, like "How to make chickens" against a collection of other search terms. To compute similarity I'm using the cosine distance and TF-IDF to transform A into a vector. I'd like to compare the similarity of A against all documents at once.
Currently, my approach involves computing the cosine similarity for A against every other document one at a time, iteratively. I have 100 documents I'm comparing against. If the result of cos_sim(A, X) > 0.8 then I break and say "cool, this is similar".
However, I feel like this might not be a true representation of the overall similarity. Is there a way to pre-compute a vector(s) for my 100 documents at runtime, and every time I see a new search query A, I can compare against this pre-defined vector/document?
I believe I can achieve this by simply combining all documents into one... feels rough though. What are the pros and & cons, and possible solutions? Extra points for efficiency!

This problem is essentially the traditional search problem: Have you tried putting your documents into something like Lucene (Java) or Whoosh (python)? I think they have a cosine-similarity model (but even if they don't, the default may be better).
The general trick all search engines use is that in general, documents are sparse. This means to compute the similarity (e.g., cosine similarity) it only matters what the lengths of the documents are (known way ahead of time) and the terms that they both contain; you can organize a data structure like a back-of-the-book index, called an inverted index that can quickly tell you which documents will get at least a non-zero score.
With only 100 documents, a search engine is probably overkill; you want to pre-compute the TF-IDF vectors and keep them in a numpy matrix. You can then use numpy operations to compute the dot product all at once for all the documents -- it will output a 1x100 vector of the numerators you need. The denominators can similarly be precomputed. A numpy.max(numpy.dot(query, docs)/denom) will then probably be fast enough.
You should profile your code, but I would bet your vector extraction is the slow part; but you should only have to do that once for all queries.
If you had thousands or millions of documents to compare against, you could look into SciKit learn's K-nearest-neighbor structures (e.g., Ball Tree or KDTree, or things like Facebook's FAISS library.

Finding the most similar documents (nearest neighbours) from a set of documents

I have 80,000 documents that are about a very vast number of topics. What I want to do is for every article, provide links to recommend other articles (something like top 5 related articles) that are similar to the one that a user is currently reading. If I don't have to, I'm not really interested in classifying the documents, just similarity or relatedness, and ideally I would like to output a 80,000 x 80,000 matrix of all the documents with the corresponding distance (or perhaps correlation? similarity?) to other documents in the set.
I'm currently using NLTK to process the contents of the document and get ngrams, but from there I'm not sure what approach I should take to calculate the similarity between documents.
I read about using tf-idf and cosine similarity, however because of the vast number of topics I'm expecting a very high number of unique tokens, so multiplying two very long vectors might be a bad way to go about it. Also 80,000 documents might call for a lot of multiplication between vectors. (Admittedly, it would only have to be done once though, so it's still an option).
Is there a better way to get the distance between documents without creating a huge vector of ngrams? Spearman Correlation? or would a more low-tech approach like taking the top ngrams and finding other documents with the same ngrams in the top k-ngrams be more appropriate? I just feel like surely I must be going about the problem in the most brute force way possible if I need to multiply possibly 10,000 element vectors together 320 million times (sum of the arithmetic series 79,999 + 79,998... to 1).
Any advice for approaches or what to read up on would be greatly appreciated.

So for K=5 you basically want to return the K-Nearest Neighbors to a particular document? In that case you should use the K-Nearest Neighbors algorithm. Scikit-Learn has some good text importing and normalizing routines (tfidf) and then its pretty easy to implement KNN.
The heuristics are basically just creating normalized word count vectors from all of the words in a document and then comparing the distance between the vectors. I would definitely swap out a few different distance metrics: Euclidean vs. Manhattan vs. Cosine Similarity for instance. The vectors aren't really long, they just sit in a high dimensional space. So you can fix the unique words issue you wrote of by just doing some dimensionality reduction through PCA or your favorite algo.
Its probably equally easy to do this in another package, but the documentation of scikit learn is top notch and makes it easy to learn quickly and thoroughly.

You should learn about hashing mechanisms that can be used to calculate similarity between documents.
Typical hash functions are designed to minimize collision mapping near duplicates to very different hash keys. In cryptographic hash functions, if the data is changed with one bit, the hash key will be changed to a completely different one.
The goal of similarity hashing is to create a similarity hash function. Hash based techniques for near duplicate detection are designed for the opposite intent of cryptographic hash algorithms. Very similar documents map to very similar hash keys, or even to the same key. The difference between bitwise hamming distance of keys is a measure of similarity.
After calculating the hash keys, keys can be sorted to increase the speed of near duplicate detection from O(n2) to O(nlog(n)). A threshold can be defined and tuned by analysing accuracy of training data.
Simhash, Minhash and Local sensitive hashing are three implementations of hash based methods. You can google and get more information about these. There are a lot of research papers related to this topic...

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!

Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)

That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.