I built a box-embedding model on the latest wikipedia articles dump and i need to compare it with the word2vec model in gensim. I saw that if i generate the corpus data as a txt file using get_texts() method in class WikiCorpus there are a lot of stop words, so this make me think that WikiCorpus doesn't delete stop words isn't it?. Now once trained my box model on the wiki corpus txt i notice that calling the "most similar" function that i create appositely for box embedding prints very often stop words, instead the same word passed to the most similar function of word2vec model trained on the same corpus txt produce best results. Can someone suggest me why Word2vec model fit so well despite the corpus txt have a lot of stop words instead my box model on the same corpus not?
How did you train a box-embedding, and why did you think it would offer good most_similar() results?
From a (very) quick glance at the 'BoxE' paper by Abboud et al (2020), it appears to require training based on a knowledge base representation – not the raw text that'd come from WikiCorpus. (So I'd not necessarily expect a properly-trained BoxE embedding would have 'stop words' in it at all.)
And, BoxE appears to be optimized for evaluating potential facts – not more general most_similar rankings. So I'd not expect a simple most_similar listing from it to necessarily be expressive.
In usual word2vec, removing stop-words isn't very important and plenty of published work doesn't bother doing so. The downsampling of highly-frequent words already tends to ignore many stop-word occurrences – and their highly diverse usage contexts mean they are likely to get weak word-vectors not especially close to other more-narrow-meaning word-vectors.
So in usual word2vec, stop-words aren't very likely to be in the top-neighbors, by cosine-similarity, of other more-specific entity words.
Related
I need to train a model in Python based on word2vec or other models in order to get adjectives which are semantically close to a world.
For example give a word like 'cat' to model and receive adjectives like 'cute', 'nice', etc.
Is there any way?
With any word2vec model – whether you train it on your own data, or download someone else's pre-trained model – you can give it a word like cat and receive back a ranked list of words that are considered 'similar' in its coordinate system.
However, these won't normally be limited to adjectives, as typical word2vec models don't take any note of a word's part-of-speech. So to filter to just adjectives, some options could include:
use a typical word2vec set-of-vectors that is oblivious to part-of-speech, but use some external reference (like say WordNet) to check each returned word, and discard those that can't be adjectives
preprocess a suitable training corpus to label words with their part-of-speech before word2vec training, as is sometimes done. Then your model's tokens will include within them a declared part-of-speech. For example, you'd then no longer have the word good alone as a token, but (depending on what conventions you use) tagged-tokens like good/NOUN & good/ADJ instead. Then, filtering the closest-words to just adjectives is a simple matter of checking for the desired string pattern.
However, the words you receive from any such process based on word2vec might not be precisely what you're looking for. The kinds of 'semantic similarity' captured by word2vec coordinates are driven by how well words predict other nearby words under the model's limitations. Whether these will meet your needs is something you'll have to try; there could be surprises.
For example, words that humans consider antonyms, like hot & cold, will still be relatively close to each other in word2vec models, as they both describe the same aspect of something (its temperature), and often appear in the same surrounding-word contexts.
And, depending on training texts & model training parameters, different word2vec models can sometimes emphasize different kinds of similarity in their rankings. Some have suggested, for example, that using a smaller window can tend to place words that are direct replacements for each other (same syntactic roles) closer together, whereas a larger window will somewhat more bring together words used in the same topical domains (even if they aren't of the same type). Which kind of similarity would be better for your need? I'm not sure; if you have the time/resources, you could compare the quality of results from multiple contrasting models.
First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity.
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?
You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)
Is there a pretrained Gensim's Phrases model? If not, would it be possible to reverse engineer and create a phrase model using a pretrained word embedding?
I am trying to use GoogleNews-vectors-negative300.bin with Gensim's Word2Vec. First, I need to map my words into phrases so that I can look up their vectors from the Google's pretrained embedding.
I search on the official Gensim's documentation but could not find any info. Thanks!
I'm not aware of anyone sharing a Phrases model. Any such model would be very sensitive to the preprocessing/tokenization step, and the specific parameters, the creator used.
Other than the high-level algorithm description, I haven't seen Google's exact choices for tokenization/canonicalization/phrase-combination done to the data that fed into the GoogleNews 2013 word-vectors have been documented anywhere. Some guesses about preprocessing can be made by reviewing the tokens present, but I'm unaware of any code to apply similar choices to other text.
You could try to mimic their unigram tokenization, then speculatively combine strings of unigrams into ever-longer multigrams up to some maximum, check if those combinations are present, and when not present, revert to the unigrams (or largest combination present). This might be expensive if done naively, but be amenable to optimizations if really important - especially for some subset of the more-frequent words – as the GoogleNews set appears to obey the convention of listing words in descending frequency.
(In general, though it's a quick & easy starting set of word-vectors, I think GoogleNews is a bit over-relied upon. It will lack words/phrases and new senses that have developed since 2013, and any meanings it does capture are determined by news articles in the years leading up to 2013... which may not match the dominant senses of words in other domains. If your domain isn't specifically news, and you have sufficient data, deciding your own domain-specific tokenization/combination will likely perform better.)
I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this
Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.
I am interested in identifying the WordNet synset IDs for each word in a set of tags.
The words in the set provide the context for the word sense disambiguation, such as:
{mole, skin}
{mole, grass, fur}
{mole, chemistry}
{bank, river, river bank}
{bank, money, building}
I know of the lesk algorithm and libraries, such as pywsd, which is based on 10+ year old tech (which may still be cutting edge -- that is my question).
Are there better performing algorithms by now that make sense of pre-trained embeddings, like GloVe, and maybe the distances of these embeddings to each other?
Are there ready-to-use implementations of such WSD algorithms?
I know this question is close to the danger zone of asking for subjective preferences - as in this 5-year old thread. But I am not asking for an overview of options or the best software for a problem.
Transfer learning, particularly models like Allen AI’s ELMO, OpenAI’s Open-GPT, and Google’s BERT allowed researchers to smash multiple benchmarks with minimal task-specific fine-tuning and provided the rest of the NLP community with pretrained models that could easily (with less data and less compute time) be fine-tuned and implemented to produce state of the art results.
these representations will help you accuratley retrieve results matching the customer's intent and contextual meaning(), even if there's no keyword or phrase overlap.
To start off, embeddings are simply (moderately) low dimensional representations of a point in a higher dimensional vector space.
By translating a word to an embedding it becomes possible to model the semantic importance of a word in a numeric form and thus perform mathematical operations on it.
When this was first possible by the word2vec model it was an amazing breakthrough. From there, many more advanced models surfaced which not only captured a static semantic meaning but also a contextualized meaning. For instance, consider the two sentences below:
I like apples.
I like Apple macbooks
Note that the word apple has a different semantic meaning in each sentence. Now with a contextualized language model, the embedding of the word apple would have a different vector representation which makes it even more powerful for NLP tasks.
contextual embedding's like BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.