I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out the best way to do this.
Here's what my code looks like:
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0,
epochs=10, workers=cores, dbow_words=1, train_lbls=False)
all_docs = load_all_files() # this function returns a named tuple
random.shuffle(all_docs)
print("Docs loaded!")
model.build_vocab(all_docs)
model.train(all_docs, total_examples=model.corpus_count, epochs=5)
I can't figure out the right way to go forward. Is this something that doc2vec can do? In the end, I'd like to have a ranked list of the 60,000 documents, where the first one is the "most similar" document.
Thanks for any help you might have! I've spent a lot of time reading the gensim help documents and the various tutorials floating around and haven't been able to figure it out.
EDIT: I can use this code to get the documents most similar to a short sentence:
token = "words associated with my research questions".split()
new_vector = model.infer_vector(token)
sims = model.docvecs.most_similar([new_vector])
for x in sims:
print(' '.join(all_docs[x[0]][0]))
If there's a way to modify this to instead get the documents most similar to the 700 coded documents, I'd love to learn how to do it!
Your general approach is reasonable. A few notes about your setup:
you'd have to specify epochs=10 in your train() call to truly get 10 training passes – and 10 or more is most common in published work
sample-controlled downsampling helps speed training and often improves vector quality as well, and the value can become more aggressive (smaller) with larger datasets
train_lbls is not a parameter to Doc2Vec in any recent gensim version
There are several possible ways to interpret and pursue your goal of "find the 'most similar' documents to the 700 we already hand-coded". For example, for a candidate document, how should its similarity to the set-of-700 be defined - as a similarity to one summary 'centroid' vector for the full set? Or as its similarity to any one of the documents?
There are a couple ways you could obtain a single summary vector for the set:
average their 700 vectors together
combine all their words into one synthetic composite document, and infer_vector() on that document. (But note: texts fed to gensim's optimized word2vec/doc2vec routines face an internal implementation limit of 10,000 tokens – excess words are silently ignored.)
In fact, the most_similar() method can take a list of multiple vectors as its 'positive' target, and will automatically average them together before returning its results. So if, say, the 700 document IDs (tags used during training) are in the list ref_docs, you could try...
sims = model.docvecs.most_similar(positive=ref_docs, topn=0)
...and get back a ranked list of all other in-model documents, by their similarity to the average of all those positive examples.
However, the alternate interpretation, that a document's similarity to the reference-set is its highest similarity to any one document inside the set, might be better for your purpose. This could especially be the case if the reference set itself is varied over many themes – and thus not well-summarized by a single average vector.
You'd have to compute these similarities with your own loops. For example, roughly:
sim_to_ref_set = {}
for doc_id in all_doc_ids:
sim_to_ref_set[doc_id] = max([model.docvecs.similarity(doc_id, ref_id) for ref_id in ref_docs])
sims_ranked = sorted(sim_to_ref_set.items(), key=lambda it:it[1], reverse=True)
The top items in sims_ranked would then be those most-similar to any item in the reference set. (Assuming the reference-set ids are also in all_doc_ids, the 1st 700 results will be the chosen docs again, all with a self-similarity of 1.0.)
n_similarity looks like the function you want, but it seem to only work with samples in the training set.
Since you have only 700 documents to crosscheck with, using sklearn shouldn't post performance issues. Simply get the vectors of your 700 documents and use sklearn.metrics.pairwise.cosine_similarity and then find the closest match. Then you can find the ones with the highest similarity (e.g. using `np.argmax). Some un-tested code to illustrate that:
from sklearn.metrics.pairwise import cosine_similarity
reference_vectors = ... # your vectors to the 700 documents
new_vector = ... # inferred as per your last example
similarity_matrix = cosine_similarity([new_vector], reference_vectors)
most_similar_indices = np.argmax(similarity_matrix, axis=-1)
That can also be modified to implement a method like n_similarity for a number of unseen documents.
I think you can do what you want with TaggedDocument. The basic use case is to just add a unique tag (document id) for every document, but here you will want to add a special tag to all 700 of your hand-selected documents. Call it whatever you want, in this case I call it TARGET. Add that tag to only your 700 hand-tagged documents, omit it for the other 59,300.
TaggedDocument(words=gensim.utils.simple_preprocess(document),tags=['TARGET',document_id])
Now, train your Doc2Vec.
Next, you can use model.docvecs.similarity to score the similarity between your unlabeled documents and the custom tag.
model.docvecs.similarity(document_id,'TARGET')
And then just sort that. n_similarity and most_similar I don't think are going to be appropriate for what you want to do.
60,000 documents is not very many for Doc2Vec, but maybe you will have good luck.
Related
I have a corpus that contains several documents, for example 10 documents. The idea is to compute the similarity between them and combine the most similar ones into one document. So the result may be 4 documents. What I have done so far is that I iterate over the documents and calculate the most two similar documents and combine them into one document and so on until a threshold. I used Word2vec vectors by taking the mean vector for the whole document. The problem is that as i proceed with the iteration, the longer the document the more similar even if its not that similar due to the presence of more words. Any idea on how to approach this problem?
I used google Word2vec model. The reason: The corpus is not big to train a model.
Note : I do not want to use topic modeling for some specifications. Also the documents are really short, more than half of them may be one sentence.
I really appreciate your suggestions.
I have been given a doc2vec model using gensim which was trained on 20 Million documents. The 20 Million documents it was trained are also given to me but I have no idea how or which order the documents were trained in from the folder. I am supposed to use the test data to find the top 10 match from the training set. The code I use is -
model = gensim.models.doc2vec.Doc2Vec.load("doc2vec_sample.model")
test_docs=["This is the test set I want to test on."]
def read_corpus(documents, tokens_only=False):
count=0
count=count+1
for line in documents:
if tokens_only:
yield gensim.utils.simple_preprocess(line)
else:
# For training data, add tags
yield gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [count])
test_corpus = list(read_corpus(test_docs, tokens_only=True))
doc_id=0
inferred_vector = model.infer_vector(test_corpus[doc_id])
maxx=10
sims = model.docvecs.most_similar([inferred_vector], topn=maxx)
for match in sims:
print match
`
The output I get is -
(1913, 0.4589531719684601)
(3250, 0.4300411343574524)
(1741, 0.42669129371643066)
(1, 0.4023148715496063)
(1740, 0.3929900527000427)
(1509, 0.39229822158813477)
(3189, 0.387174129486084)
(3145, 0.3842133581638336)
(1707, 0.3813004493713379)
(3200, 0.3754497170448303)
How do I get to know which document does document id "1913" refer to? How can I access the documents of the trained data set from these 10 job ids?
The best approach is to ask the person who trained the model how they assigned IDs ('tags' in Doc2Vec parlance) to documents.
If that's not available, look at the training corpus to see if there's any natural naming or ordering that applies to the documents. (Are they one per file? Then perhaps the filenames in sorted order map to ascending IDs. Is each document a line in a single file? Then perhaps the line-number is the ID-tag.
When you have a theory, if the model was a usefully-trained model, then you can test it by seeing if the most_similar() results make sense with that ID-tag interpretation.
You could do this in an ad-hoc fashion – do the results or random probes of query-documents look good to you?
Or you could try to formalize it, for example by re-inferring vectors for documents that were known to be in the training set, then looking for the most-similar documents to those vectors. If the model is good and if the inference is working well (which might require tweaking the infer_vector() parameters, then either the "top hit" for a vector, or one of the top hits, should be for the exact same document.
But really, if the model is so poorly documented you can't correlate the documents to the IDs, and the original person isn't available, you may want to throw it out and re-train a document with better-documented procedures.
Simply print documents into a list and query the 20 Million list. Of course, you don't want to do print(documents) and get 20 million vectors in your screen. It may be more efficient to insert the list in documents into a database table. When you print the documents vector (i.e., train_corpus from gensim doc2vec tutorial), the result is a list in the following format:
[TaggedDocument(words=['token1', 'token2',..., 'tokenn'], tags=[document number]).
You can query this result to find the 1913th document in the list.
I'm using word2vec on a 1 million abstracts dataset (2 billion words). To find most similar documents, I use the gensim.similarities.WmdSimilarity class. When trying to retrieve the best match using wmd_similarity_index[query], the calculation spends most of its time building a dictionary. Here is a piece of log:
2017-08-25 09:45:39,441 : INFO : built Dictionary(127 unique tokens: ['empirical', 'model', 'estimating', 'vertical', 'concentration']...) from 2 documents (total 175 corpus positions)
2017-08-25 09:45:39,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
What does this part ? Is it dependent on the query ? Is there a way to do these calculations once for all ?
EDIT: training and scoring phases in my code:
Training and saving to disk:
w2v_size = 300
word2vec = gensim.models.Word2Vec(texts, size=w2v_size, window=9, min_count=5, workers=1, sg=1, hs=1, iter=20) # sg=1 means skip gram is used
word2vec.save(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
corpus_w2v_wmd_index = gensim.similarities.WmdSimilarity(texts, word2vec.wv)
corpus_w2v_wmd_index.save(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
Loading and scoring:
w2v = gensim.models.Word2Vec.load(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
words = [t for t in proc_text if t in w2v.wv]
corpus_w2v_wmd_index = gensim.similarities.docsim.Similarity.load(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
scores_w2v = np.array(corpus_w2v_wmd_index[words])
The "Word Mover's Distance" calculation is relatively expensive – for each pairwise document comparison, it searches for an optimal 'shifting' of semantic positions, and that shifting is itself dependent on the pairwise simple-distances between all words of each compared document.
That is, it involves far more calculation than a simple cosine-distance between two high-dimensional vectors, and it involves more calculation the longer the two documents are.
There isn't much that could be pre-calculated, from the texts corpus, until the query's words are known. (Each pairwise calculation depends on the query's words, and their simple-distances to each corpus document's words.)
That said, there are some optimizations the gensim WmdSimilarity class doesn't yet do.
The original WMD paper described a quicker calculation that could help eliminate corpus texts that couldn't possibly be in the top-N most-WMD-similar results. Theoretically, the gensim WmdSimilarity could also implement this optimization, and give quicker results, at least when initializing the WmdSimilarity with the num_best parameter. (Without it, every query returns all WMD-similarity-scores, so this optimization wouldn't help.)
Also, for now the WmdSimilarity class just calls KeyedVectors.wmdistance(doc1, doc2) for every query-to-corpus-document pair, as raw texts. Thus the pairwise simple-distances from all doc1 words to doc2 words will be recalculated each time, even if many pairs repeat across the corpus. (That is, if 'apple' is in the query and 'orange' is in every corpus doc, it will still calculate the 'apple'-to-'orange' distance repeatedly.)
So, some caching of interim values might help performance. For example, with a query of 1000 words, and a vocabulary of 100,000 words among all corpus documents, the ((1000 * 100,000) / 2) 50 million pairwise word-distances could be precalculated once, using 200MB, then shared by all subsequent WMD-calculations. To add this optimization would require a cooperative refactoring of WmdSimilarity.get_similarities() and KeyedVectors.wmdistance().
Finally, Word2Vec/Doc2Vec applications don't necessarily require or benefit much from stop-word removal or stemming. But because the expense of WMD calculation grows with document and vocabulary size, anything that shrinks effective document sizes could help performance. So various ways of discarding low-value words, or coalescing similar words, may be worth considering when using WMD on large document sets.
I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)
Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book
use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.
I am writing a ML module (python) to predict tags for a stackoverflow question (tag + body). My corpus is of around 5 million questions with title, body and tags for each. I'm splitting this 3:2 for training and testing. I'm plagued by the curse of dimensionality.
Work Done
Pre-processing: markup removal, stopword removal, special character removal and a few bits and pieces. Store into MySQL. This almost halves the size of the test data.
ngram association: for each unigram and bigram in the title and the body of each question, I maintain a list of the associated tags. Store into redis. This results in about a million unique unigrams and 20 million unique bigrams, each with a corresponding list of tag frequencies. Ex.
"continuous integration": {"ci":42, "jenkins":15, "windows":1, "django":1, ....}
Note: There are 2 problems here: a) Not all unigrams and bigrams are important and, b) not all tags associated with a ngram are important, although this doesn't mean that tags with frequency 1 are all equivalent or can be haphazardly removed. The number of tags associated with a given ngram easily runs into the thousands - most of them unrelated and irrelevant.
tfidf: to aid in selecting which ngrams to keep, I calculated the tfidf score for the entire corpus for each unigram and bigram and stored the corresponding idf values with associated tags. Ex.
"continuous integration": {"ci":42, "jenkins":15, ...., "__idf__":7.2123}
The tfidf scores are stored in a documentxfeature sparse.csr_matrix, and I'm not sure how I can leverage that at the moment. (it is generated by fit_transform())
Questions
How can I use this processed data to reduce the size of my feature set? I've read about SVD and PCA but the examples always talk about a set of documents and a vocabulary. I'm not sure where the tags from my set can come in. Also, the way my data is stored (redis + sparse matrix), it is difficult to use an already implemented module (sklearn, nltk etc) for this task.
Once the feature set is reduced, the way I have planned to use it is as follows:
Preprocess the test data.
Find the unigrams and bigrams.
For the ones stored in redis, find the corresponding best-k tags
Apply some kind of weight for the title and body text
Apart from this I might also search for exact known tag matches in the document. Ex, if "ruby-on-rails" occurs in the title/body then its a high probability that it's also a relevant tag.
Also, for tags predicted with a high probability, I might leverage a tag graph (a undirected graph with tags frequently occurring together having weighted edges between them) to predict more tags.
Are there any suggestions on how to improve upon this? Can a classifier come in handy?
Footnote
I've a 16-core, 16GB RAM machine. The redis-server (which I'll move to a different machine) is stored in RAM and is ~10GB. All the tasks mentioned above (apart from tfidf) are done in parallel using ipython clusters.
Use the public Api of Dandelion, this is a demo.
It extracts concepts from a text, so, in order to reduce dimentionality, you could use those concepts, instead of the bag-of-word paradigm.
A baseline statistical approach would treat this as a classification problem. Features are bags-of-words processed by a maximum entropy classifier like Mallet http://mallet.cs.umass.edu/classification.php. Maxent (aka logistic regression) is good at handling large feature spaces. Take the probability associated with each each tag (i.e., the class labels) and choose some decision threshold that gives you a precision/recall tradeoff that works for your project. Some of the Mallet documentation even mentions topic classification, which is very similar to what you are trying to do.
The open questions are how well Mallet handles the size of your data (which isn't that big) and whether this particular tool is a non-starter with the technology stack you mentioned. You might be able to train offline (dump the reddis database to a text file in Mallet's feature format) and run the Mallet-learned model in Python. Evaluating a maxent model is simple. If you want to stay in Python and have this be more automated, there are Python-based maxent implementations in NLTK and probably in scikit-learn. This approach is not at all state-of-the-art, but it'll work okay and be a decent baseline with which to compare more complicated methods.