I have a corpus that contains several documents, for example 10 documents. The idea is to compute the similarity between them and combine the most similar ones into one document. So the result may be 4 documents. What I have done so far is that I iterate over the documents and calculate the most two similar documents and combine them into one document and so on until a threshold. I used Word2vec vectors by taking the mean vector for the whole document. The problem is that as i proceed with the iteration, the longer the document the more similar even if its not that similar due to the presence of more words. Any idea on how to approach this problem?
I used google Word2vec model. The reason: The corpus is not big to train a model.
Note : I do not want to use topic modeling for some specifications. Also the documents are really short, more than half of them may be one sentence.
I really appreciate your suggestions.
Related
I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.
I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.
I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.
You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...
Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.
Having used Spacy to find similarity across few texts, now I'm trying to find similar texts in millions of entries (instantaneously).
I have an app with millions of texts and I'd like to present the user with similar texts if they ask to.
How sites like StackOverflow find similar questions so fast?
I can imagine 2 approaches:
Each time a text is inserted, the entire DB is compared and a link is done between both questions (in a intermediate table with both foreign keys)
Each time a text is inserted, the vector is inserted in a field associated with this text. Whenever a user asks for similar texts, its "searches" the DB for similar texts.
My doubt is with the second choice. Storing the word vector is enough for searching quickly for similar texts?
Comparing all the texts every time a new request comes in is infeasible.
To be really fast on large datasets I can recommend Locality-sensitive Hasing (LSH). It gives you entries that are similar with high probability. It significantly reduces the Complexity of your algorithm.
However, you have to train your algorithm once - that may take time - but after that it's very fast.
https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134
https://en.wikipedia.org/wiki/Locality-sensitive_hashing
Here is a tutorial that seems close to your application:
https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/
You want a function that can map quickly from a text, into a multi-dimensional space. Your collection of documents should be indexed with respect to that space such that you can quickly find the shortest-distance match between your text, and those in the space.
Algorithms exist that will speed up that indexing process - but could be as simple as sub-indexing the space into shards or blocks on a less granular basis and narrowing down the search like that.
One simple way of defining such a space might be on term-frequency (TF), term-frequency-inverse document frequency (TFIDF) - but without defining a limit on your vocabulary size, these can suffer from space/accuracy issues - still, with a vocabulary of the most specific 100 words in a corpus, you should be able to get a reasonable indication of similarity that would scale to millions of results. It depends on your corpus.
There are plenty of alternative features you might consider - but all of them will resolve to having a reliable method of transforming your document into a geometric vector, which you can then interrogate for similarity.
I am working on a document similarity problem. For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector.
I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. How could I do that?
You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. That being said, I think I can help you out generally, even without know what you want gensim to do. gensim is quite powerful, and offers lots of different functionality.
Assuming your dictionary is already in gensim format, you can load it like this:
from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')
There - now you can use it with gensim, and run analyses and model to your heart's desire. For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc.
For document similarity with a trained LDA model, see this stackoverflow question.
For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents.
gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim, I would recommend looking at the documentation and examples.
Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec), and might lose valuable information.