I'm using word2vec on a 1 million abstracts dataset (2 billion words). To find most similar documents, I use the gensim.similarities.WmdSimilarity class. When trying to retrieve the best match using wmd_similarity_index[query], the calculation spends most of its time building a dictionary. Here is a piece of log:
2017-08-25 09:45:39,441 : INFO : built Dictionary(127 unique tokens: ['empirical', 'model', 'estimating', 'vertical', 'concentration']...) from 2 documents (total 175 corpus positions)
2017-08-25 09:45:39,445 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
What does this part ? Is it dependent on the query ? Is there a way to do these calculations once for all ?
EDIT: training and scoring phases in my code:
Training and saving to disk:
w2v_size = 300
word2vec = gensim.models.Word2Vec(texts, size=w2v_size, window=9, min_count=5, workers=1, sg=1, hs=1, iter=20) # sg=1 means skip gram is used
word2vec.save(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
corpus_w2v_wmd_index = gensim.similarities.WmdSimilarity(texts, word2vec.wv)
corpus_w2v_wmd_index.save(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
Loading and scoring:
w2v = gensim.models.Word2Vec.load(utils.paths.PATH_DATA_GENSIM_WORD2VEC)
words = [t for t in proc_text if t in w2v.wv]
corpus_w2v_wmd_index = gensim.similarities.docsim.Similarity.load(utils.paths.PATH_DATA_GENSIM_CORPUS_WORD2VEC_WMD_INDEX)
scores_w2v = np.array(corpus_w2v_wmd_index[words])
The "Word Mover's Distance" calculation is relatively expensive – for each pairwise document comparison, it searches for an optimal 'shifting' of semantic positions, and that shifting is itself dependent on the pairwise simple-distances between all words of each compared document.
That is, it involves far more calculation than a simple cosine-distance between two high-dimensional vectors, and it involves more calculation the longer the two documents are.
There isn't much that could be pre-calculated, from the texts corpus, until the query's words are known. (Each pairwise calculation depends on the query's words, and their simple-distances to each corpus document's words.)
That said, there are some optimizations the gensim WmdSimilarity class doesn't yet do.
The original WMD paper described a quicker calculation that could help eliminate corpus texts that couldn't possibly be in the top-N most-WMD-similar results. Theoretically, the gensim WmdSimilarity could also implement this optimization, and give quicker results, at least when initializing the WmdSimilarity with the num_best parameter. (Without it, every query returns all WMD-similarity-scores, so this optimization wouldn't help.)
Also, for now the WmdSimilarity class just calls KeyedVectors.wmdistance(doc1, doc2) for every query-to-corpus-document pair, as raw texts. Thus the pairwise simple-distances from all doc1 words to doc2 words will be recalculated each time, even if many pairs repeat across the corpus. (That is, if 'apple' is in the query and 'orange' is in every corpus doc, it will still calculate the 'apple'-to-'orange' distance repeatedly.)
So, some caching of interim values might help performance. For example, with a query of 1000 words, and a vocabulary of 100,000 words among all corpus documents, the ((1000 * 100,000) / 2) 50 million pairwise word-distances could be precalculated once, using 200MB, then shared by all subsequent WMD-calculations. To add this optimization would require a cooperative refactoring of WmdSimilarity.get_similarities() and KeyedVectors.wmdistance().
Finally, Word2Vec/Doc2Vec applications don't necessarily require or benefit much from stop-word removal or stemming. But because the expense of WMD calculation grows with document and vocabulary size, anything that shrinks effective document sizes could help performance. So various ways of discarding low-value words, or coalescing similar words, may be worth considering when using WMD on large document sets.
Related
I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this
Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.
In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.
There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector() method. However, like Doc2Vec in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector(), so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1) does this, or PV-DBOW (dm=0) if you add the optional interleaved word-vector training (dbow_words=1). (If you use dm=0, dbow_words=0, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.
I have a list of ~10 million sentences, where each of them contains up to 70 words.
I'm running gensim word2vec on every word, and then taking the simple average of each sentence. The problem is that I use min_count=1000, so a lot of words are not in the vocab.
To solve that, I intersect the vocab array (that contains about 10000 words) with every sentence, and if there's at least one element left in that intersection, it returns its the simple average, otherwise, it returns a vector of zeros.
The issue is that calculating every average takes a very long time when I run it on the whole dataset, even when splitting into multiple threads, and I would like to get a better solution that could run faster.
I'm running this on an EC2 r4.4xlarge instance.
I already tried switching to doc2vec, which was way faster, but the results were not as good as word2vec's simple average.
word2vec_aug_32x = Word2Vec(sentences=sentences,
min_count=1000,
size=32,
window=2,
workers=16,
sg=0)
vocab_arr = np.array(list(word2vec_aug_32x.wv.vocab.keys()))
def get_embedded_average(sentence):
sentence = np.intersect1d(sentence, vocab_arr)
if sentence.shape[0] > 0:
return np.mean(word2vec_aug_32x[sentence], axis=0).tolist()
else:
return np.zeros(32).tolist()
pool = multiprocessing.Pool(processes=16)
w2v_averages = np.asarray(pool.map(get_embedded_average, np.asarray(sentences)))
pool.close()
If you have any suggestions of different algorithms or techniques that have the same purpose of sentence embedding and could solve my problem, I would love to read about it.
You could use FastText instead of Word2Vec. FastText is able to embed out-of-vocabulary words by looking at subword information (character ngrams). Gensim also has a FastText implementation, which is very easy to use:
from gensim.models import FastText
model = FastText(sentences=training_data, size=128, ...)
word = 'hello' # can be out of vocabulary
embedding = model[word] # fetches the word embedding
Usually Doc2Vec text-vector usefulness is quite-similar (or when tuned, a little better) compared to a plain average-of-word-vectors. (After all, the algorithms are very similar, working on the same form of the same data, and the models created are about the same size.) If there was a big drop-off, there may have been errors in your Doc2Vec process.
As #AnnaKrogager notes, FastText can handle out-of-vocabulary words by synthesizing guesswork vectors, using word-fragments. (This requires languages where words have such shared roots.) The vectors may not be great but are often better than either ignoring unknown words entirely, or using all-zero-vectors or random-plug-vectors.
Is splitting it among processes helping the runtime at all? Because there's a lot of overhead in sending batches-of-work to-and-from subprocesses, and subprocesses in Python can cause a ballooning of memory needs – and both that overhead and possibly even virtual-memory swapping could outweigh any other benefits of parallelism.
I have a sample of ~60,000 documents. We've hand coded 700 of them as having a certain type of content. Now we'd like to find the "most similar" documents to the 700 we already hand-coded. We're using gensim doc2vec and I can't quite figure out the best way to do this.
Here's what my code looks like:
cores = multiprocessing.cpu_count()
model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2, sample=0,
epochs=10, workers=cores, dbow_words=1, train_lbls=False)
all_docs = load_all_files() # this function returns a named tuple
random.shuffle(all_docs)
print("Docs loaded!")
model.build_vocab(all_docs)
model.train(all_docs, total_examples=model.corpus_count, epochs=5)
I can't figure out the right way to go forward. Is this something that doc2vec can do? In the end, I'd like to have a ranked list of the 60,000 documents, where the first one is the "most similar" document.
Thanks for any help you might have! I've spent a lot of time reading the gensim help documents and the various tutorials floating around and haven't been able to figure it out.
EDIT: I can use this code to get the documents most similar to a short sentence:
token = "words associated with my research questions".split()
new_vector = model.infer_vector(token)
sims = model.docvecs.most_similar([new_vector])
for x in sims:
print(' '.join(all_docs[x[0]][0]))
If there's a way to modify this to instead get the documents most similar to the 700 coded documents, I'd love to learn how to do it!
Your general approach is reasonable. A few notes about your setup:
you'd have to specify epochs=10 in your train() call to truly get 10 training passes – and 10 or more is most common in published work
sample-controlled downsampling helps speed training and often improves vector quality as well, and the value can become more aggressive (smaller) with larger datasets
train_lbls is not a parameter to Doc2Vec in any recent gensim version
There are several possible ways to interpret and pursue your goal of "find the 'most similar' documents to the 700 we already hand-coded". For example, for a candidate document, how should its similarity to the set-of-700 be defined - as a similarity to one summary 'centroid' vector for the full set? Or as its similarity to any one of the documents?
There are a couple ways you could obtain a single summary vector for the set:
average their 700 vectors together
combine all their words into one synthetic composite document, and infer_vector() on that document. (But note: texts fed to gensim's optimized word2vec/doc2vec routines face an internal implementation limit of 10,000 tokens – excess words are silently ignored.)
In fact, the most_similar() method can take a list of multiple vectors as its 'positive' target, and will automatically average them together before returning its results. So if, say, the 700 document IDs (tags used during training) are in the list ref_docs, you could try...
sims = model.docvecs.most_similar(positive=ref_docs, topn=0)
...and get back a ranked list of all other in-model documents, by their similarity to the average of all those positive examples.
However, the alternate interpretation, that a document's similarity to the reference-set is its highest similarity to any one document inside the set, might be better for your purpose. This could especially be the case if the reference set itself is varied over many themes – and thus not well-summarized by a single average vector.
You'd have to compute these similarities with your own loops. For example, roughly:
sim_to_ref_set = {}
for doc_id in all_doc_ids:
sim_to_ref_set[doc_id] = max([model.docvecs.similarity(doc_id, ref_id) for ref_id in ref_docs])
sims_ranked = sorted(sim_to_ref_set.items(), key=lambda it:it[1], reverse=True)
The top items in sims_ranked would then be those most-similar to any item in the reference set. (Assuming the reference-set ids are also in all_doc_ids, the 1st 700 results will be the chosen docs again, all with a self-similarity of 1.0.)
n_similarity looks like the function you want, but it seem to only work with samples in the training set.
Since you have only 700 documents to crosscheck with, using sklearn shouldn't post performance issues. Simply get the vectors of your 700 documents and use sklearn.metrics.pairwise.cosine_similarity and then find the closest match. Then you can find the ones with the highest similarity (e.g. using `np.argmax). Some un-tested code to illustrate that:
from sklearn.metrics.pairwise import cosine_similarity
reference_vectors = ... # your vectors to the 700 documents
new_vector = ... # inferred as per your last example
similarity_matrix = cosine_similarity([new_vector], reference_vectors)
most_similar_indices = np.argmax(similarity_matrix, axis=-1)
That can also be modified to implement a method like n_similarity for a number of unseen documents.
I think you can do what you want with TaggedDocument. The basic use case is to just add a unique tag (document id) for every document, but here you will want to add a special tag to all 700 of your hand-selected documents. Call it whatever you want, in this case I call it TARGET. Add that tag to only your 700 hand-tagged documents, omit it for the other 59,300.
TaggedDocument(words=gensim.utils.simple_preprocess(document),tags=['TARGET',document_id])
Now, train your Doc2Vec.
Next, you can use model.docvecs.similarity to score the similarity between your unlabeled documents and the custom tag.
model.docvecs.similarity(document_id,'TARGET')
And then just sort that. n_similarity and most_similar I don't think are going to be appropriate for what you want to do.
60,000 documents is not very many for Doc2Vec, but maybe you will have good luck.
I want to make sure I understand what the attributes use_idf and sublinear_tf do in the TfidfVectorizer object. I've been researching this for a few days. I am trying to classify documents with varied length and use currently tf-idf for feature selection.
I believe when use_idf=true the algo normalises the bias against the inherent issue (with TF) where a term that is X times more frequent shouldn't be X times as important.
Utilising the tf*idf formula. Then the sublinear_tf = true instills 1+log(tf) such that it normalises the bias against lengthy documents vs short documents.
I am dealing with an inherently bias towards lengthy documents (most belong to one class), does this normalisation really diminish the bias?
How can I make sure the length of the documents in the corpus are not integrated into the model?
I'm trying to verify that the normalisation is being applied in the model. I am trying to extract the normalizated vectors of the corpora, so I assumed I could just sum up each row of the Tfidfvectorizer matrix. However the sum are greater than 1, I thought a normalized copora would transform all documents to a range between 0-1.
vect = TfidfVectorizer(max_features=20000, strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')
tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)
Neither use_idf nor sublinear_tf deals with document length. And actually your explanation for use_idf "where a term that is X times more frequent shouldn't be X times as important" is more fitting as a description to sublinear_tf as sublinear_tf causes logarithmic increase in Tfidf score compared to the term frequency.
use_idf means to use Inverse Document Frequency, so that terms that appear very frequently to the extent they appear in most document (i.e., a bad indicator) get weighted less compared to terms that appear less frequently but they appear in specific documents only (i.e., a good indicator).
To reduce document length bias, you use normalization (norm in TfidfVectorizer parameter) as you proportionally scale each term's Tfidf score based on total score of that document (simple average for norm=l1, squared average for norm=l2)
By default, TfidfVectorizer already use norm=l2, though, so I'm not sure what is causing the problem you are facing. Perhaps those longer documents indeed contain similar words also? Also classification often depend a lot on the data, so I can't say much here to solve your problem.
References:
TfidfVectorizer documentation
Wikipedia
Stanford Book
use_idf=true (by default) introduces a global component to the term frequency component (local component: individual article). When looking after the similarity of two texts, instead of counting the number of terms that each of them has and compare them, introducing the idf helps categorizing these terms into relevant or not. According to Zipf's law, the "frequency of any word is inversely proportional to its rank". That is, the most common word will appear twice as many times as the second most common word, three times as the third most common word etc. Even after removing stop words, all words are subjected to Zipf's law.
In this sense, imagine you have 5 articles describing a topic of automobiles. In this example the word "auto" will likely to appear in all 5 texts, and therefore will not be a unique identifier of a single text. On the other hand, if only an article describes auto "insurance" while another describes auto "mechanics", these two words ("mechanics" and "insurance") will be a unique identifier of each texts. By using the idf, words that appear less common in a texts ("mechanics" and "insurance" for example) will receive a higher weight. Therefore using an idf does not tackle the bias generated by the length of an article, since is again, a measure of a global component. If you want to reduce the bias generated by length then as you said, using sublinear_tf=True will be a way to solve it since you are transforming the local component (each article).
Hope it helps.