Sentences embedding using word2vec - python

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.

Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance

If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this

Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.

Related

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia? 
You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

Word-sense disambiguation based on sets of words using pre-trained embeddings

I am interested in identifying the WordNet synset IDs for each word in a set of tags.
The words in the set provide the context for the word sense disambiguation, such as:
{mole, skin}
{mole, grass, fur}
{mole, chemistry}
{bank, river, river bank}
{bank, money, building}
I know of the lesk algorithm and libraries, such as pywsd, which is based on 10+ year old tech (which may still be cutting edge -- that is my question).
Are there better performing algorithms by now that make sense of pre-trained embeddings, like GloVe, and maybe the distances of these embeddings to each other?
Are there ready-to-use implementations of such WSD algorithms?
I know this question is close to the danger zone of asking for subjective preferences - as in this 5-year old thread. But I am not asking for an overview of options or the best software for a problem.
Transfer learning, particularly models like Allen AI’s ELMO, OpenAI’s Open-GPT, and Google’s BERT allowed researchers to smash multiple benchmarks with minimal task-specific fine-tuning and provided the rest of the NLP community with pretrained models that could easily (with less data and less compute time) be fine-tuned and implemented to produce state of the art results.
these representations will help you accuratley retrieve results matching the customer's intent and contextual meaning(), even if there's no keyword or phrase overlap.
To start off, embeddings are simply (moderately) low dimensional representations of a point in a higher dimensional vector space.
By translating a word to an embedding it becomes possible to model the semantic importance of a word in a numeric form and thus perform mathematical operations on it.
When this was first possible by the word2vec model it was an amazing breakthrough. From there, many more advanced models surfaced which not only captured a static semantic meaning but also a contextualized meaning. For instance, consider the two sentences below:
I like apples.
I like Apple macbooks
Note that the word apple has a different semantic meaning in each sentence. Now with a contextualized language model, the embedding of the word apple would have a different vector representation which makes it even more powerful for NLP tasks.
contextual embedding's like BERT offers an advantage over models like Word2Vec, because while each word has a fixed representation under Word2Vec regardless of the context within which the word appears, BERT produces word representations that are dynamically informed by the words around them.

How do I calculate the similarity of a word or couple of words compared to a document using a doc2vec model?

In gensim I have a trained doc2vec model, if I have a document and either a single word or two-three words, what would be the best way to calculate the similarity of the words to the document?
Do I just do the standard cosine similarity between them as if they were 2 documents? Or is there a better approach for comparing small strings to documents?
On first thought I could get the cosine similarity from each word in the 1-3 word string and every word in the document taking the averages, but I dont know how effective this would be.
There's a number of possible approaches, and what's best will likely depend on the kind/quality of your training data and ultimate goals.
With any Doc2Vec model, you can infer a vector for a new text that contains known words – even a single-word text – via the infer_vector() method. However, like Doc2Vec in general, this tends to work better with documents of at least dozens, and preferably hundreds, of words. (Tiny 1-3 word documents seem especially likely to get somewhat peculiar/extreme inferred-vectors, especially if the model/training-data was underpowered to begin with.)
Beware that unknown words are ignored by infer_vector(), so if you feed it a 3-word documents for which two words are unknown, it's really just inferring based on the one known word. And if you feed it only unknown words, it will return a random, mild initialization vector that's undergone no inference tuning. (All inference/training always starts with such a random vector, and if there are no known words, you just get that back.)
Still, this may be worth trying, and you can directly compare via cosine-similarity the inferred vectors from tiny and giant documents alike.
Many Doc2Vec modes train both doc-vectors and compatible word-vectors. The default PV-DM mode (dm=1) does this, or PV-DBOW (dm=0) if you add the optional interleaved word-vector training (dbow_words=1). (If you use dm=0, dbow_words=0, you'll get fast training, and often quite-good doc-vectors, but the word-vectors won't have been trained at all - so you wouldn't want to look up such a model's word-vectors directly for any purposes.)
With such a Doc2Vec model that includes valid word-vectors, you could also analyze your short 1-3 word docs via their individual words' vectors. You might check each word individually against a full document's vector, or use the average of the short document's words against a full document's vector.
Again, which is best will likely depend on other particulars of your need. For example, if the short doc is a query, and you're listing multiple results, it may be the case that query result variety – via showing some hits that are really close to single words in the query, even when not close to the full query – is as valuable to users as documents close to the full query.
Another measure worth looking at is "Word Mover's Distance", which works just with the word-vectors for a text's words, as if they were "piles of meaning" for longer texts. It's a bit like the word-against-every-word approach you entertained – but working to match words with their nearest analogues in a comparison text. It can be quite expensive to calculate (especially on longer texts) – but can sometimes give impressive results in correlating alternate texts that use varied words to similar effect.

NER(Named Entity Recognition) Similarity between sentences in documents

I have been using spacy to find the NER of sentences.My problem is I have to calculate the NER similarity between sentences of two different documents. Is there any formula or package available in python for the same?
TIA
I believe you are asking, how similar are two named entities?
This is not so trivial, since we have to define what "similar" means.
If we use a naive bag-of-words approach, two entities are more similar when more of their tokens are identical.
If we put the entity tokens into sets, the calculation would just be the jaccard coefficient.
Sim(ent1, ent2) = |ent1 ∩ ent2| / |ent1  ∪ ent2|
Which in python would be:
ent1 = set(map(str, spacy_entity1))
ent2 = set(map(str, spacy_entity2))
similarity = len(ent1 & ent2) / len(ent1 | ent2)
Where spacy_entity is one of the entities extracted by spacy
We then just create entity sets ent by creating a set of the strings that represent them.
I've been tackling the same problem and heres how I'm Solving.
Since youve given no context about your problem, ill keep the solution as general as possible
Links to tools:
Spacy Spacy Similarity
Flair NLP: Flair NLP zero shot few shot
Roam Research for research articles and their approach Roam Research home page
DISCLAIMER: THIS IS A TRIAL EXPERIMENT. AND NOT THE SOLUTION
Steps:
try a Large language model with a similarity coefficient (Like GPT-3 or TARS) on your dataset
(Instead of similarity you could also use the zero shot/few shot classification and then manually see the accuracy or compute it if you have labelled data)
I am then grab the verbs (assuming you have a corpus and not single word inputs) and calculating their similariy (attaching x-n and x+n words [exlude stopwords according to your domain] if x is position of verb)
This step mainly allows you give more context to the large language model so it is not biased on its large corpus (this is experimental)
And finally Im grabbing all the named entities and hopefully their labels (Example India: Country/Location) And Ask the same LLM (Large Language model) after youve constructed its Prompt to see how many of the same buckets/categories your entities fall into (probability comparison maybe)
Even if you dont reproduce these steps, what you must understand is that these tools give you atomic information about your raw input. you have to put mathematic functions to compare and make the algorithm
In my case im averaging all 3 similarity index of the above steps and ensuring that all classification is multilabel classification
And to validate any of this, human pattern matching maybe.
You need probably http://uima.apache.org/d/uimacpp-2.4.0/docs/Python.html/ plus a CoNLL-U parser attached to it https://universaldependencies.org/format.html. With this approach NERs are based to a dictionary in UIMA Pipeline. You need to develop some proprietary NER search/match algorithms (in Python or in other supported language).

word2vec - what is best? add, concatenate or average word vectors?

I am working on a recurrent language model. To learn word embeddings that can be used to initialize my language model, I am using gensim's word2vec model.
After training, the word2vec model holds two vectors for each word in the vocabulary: the word embedding (rows of input/hidden matrix) and the context embedding (columns of hidden/output matrix).
As outlined in this post there are at least three common ways to combine these two embedding vectors:
summing the context and word vector for each word
summing & averaging
concatenating the context and word vector
However, I couldn't find proper papers or reports on the best strategy. So my questions are:
Is there a common solution whether to sum, average or concatenate the vectors?
Or does the best way depend entirely on the task in question? If so, what strategy is best for a word-level language model?
Why combine the vectors at all? Why not use the "original" word embeddings for each word, i.e. those contained in the weight matrix between input and hidden neurons.
Related (but unanswered) questions:
word2vec: Summing/concatenate inside and outside vector
why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?
I have found an answer in the Stanford lecture "Deep Learning for Natural Language Processing" (Lecture 2, March 2016). It's available here. In minute 46 Richard Socher states that the common way is to average the two word vectors.
You should read this research work at-least once to get the whole idea of combining word embeddings using different algebraic operators. It was my research.
In this paper you can also see the other methods to combine word vectors.
In short L1-Normalized average word vectors and sum of words are good representations.
I don't know any work that empirically tests different ways of combining the two vectors, but there is a highly influencial paper comparing: 1) just use the word vector, and 2) adding up word and context vector. The paper is here: https://www.aclweb.org/anthology/Q15-1016/.
First, note that the metric is analogy and similarity tests, NOT downstream tasks.
Here is a quote from the paper:
for both SGNS and GloVe, it is worthwhile to experiment with the w + c variant [adding up word and context vectors], which is cheap to
apply (does not require retraining) and can result
in substantial gains (as well as substantial losses).
So I guess you just need to try it out on your specific task.
By the way, here is a post on how to get context vectors from gensim: link
I thought I attempt to answer based on the comments.
The question you are linking to is: "WordVectors How to concatenate word vectors to form sentence vector"
Word vectors can be compared on its own. But often one wants to put the sentence, paragraph or a document in context - i.e. a collection of words. And then the question arises how to combine those to a single vector (gensim provides doc2vec for that use case).
That doesn't seem to be applicable in your case and I would just work with the given word vectors. You can adjust parameters like the size of the embedding, the training data, other algorithms. You could even combine vectors from different algorithms to create a kind of 'ensemble vector' (e.g. word2vec with GloVe). But it may not be more efficient.
Sometimes in language the same word has a different meaning depending on the type of word within a sentence or a combination of words. e.g. 'game' has a different meaning to 'fair game'. Sense2Vec offers a proposal to generate word vectors for those compound words: https://explosion.ai/blog/sense2vec-with-spacy
(Of course, in that case you already need something that understands the sentence structure, such as SpaCy)

Categories