Sentence meaning similarity in python - python

I want to calculate the sentence meaning similarity. I am using cosine similarity but this method does not fulfill my needs.
For example, if I have these two sentences.
He and his father are very close.
He shares a wonderful bond with his father.
What I need is calculating the similarity between these sentences based on the meaning similarity and not just matching similar words
Is there a way to do this?

One approach would be to represent each word using pre-trained word vectors ("embeddings"). These are vectors with a few hundred dimensions where words with similar meaning (e.g., "close", "bond") should have similar vectors. The key idea is that word embeddings could represent that the two sentences have similar meaning even though they use different words.
This could be done quickly in a package such as Spacy in python. See https://spacy.io/usage/vectors-similarity
Common pre-trained vectors include the Google news word embeddings (https://github.com/mmihaltz/word2vec-GoogleNews-vectors) and GLOVE embeddings (https://nlp.stanford.edu/projects/glove/).
Here's a simple approach: represent each word by its pretrained embedding and average words across the sentence. Now compare the vectors using any reasonable distance measure (cosine is standard).

Related

Sentences embedding using word2vec

I'd like to compare the difference among the same word mentioned in different sentences, for example "travel".
What I would like to do is:
Take the sentences mentioning the term "travel" as plain text;
In each sentence, replace 'travel' with travel_sent_x.
Train a word2vec model on these sentences.
Calculate the distance between travel_sent1, travel_sent2, and other relabelled mentions of "travel"
So each sentence's "travel" gets its own vector, which is used for comparison.
I know that word2vec requires much more than several sentences to train reliable vectors. The official page recommends datasets including billions of words, but I have not a such number in my dataset(I have thousands of words).
I was trying to test the model with the following few sentences:
Sentences
Hawaii makes a move to boost domestic travel and support local tourism
Honolulu makes a move to boost travel and support local tourism
Hawaii wants tourists to return so much it's offering to pay for half of their travel expenses
My approach to build the vectors has been:
from gensim.models import Word2Vec
vocab = df['Sentences']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
df['Sentences'].apply(model.vectorize)
However I do not know how to visualise the results to see their similarity and get some useful insight.
Any help and advice will be welcome.
Update: I would use Principal Component Analysis algorithm to visualise embeddings in 3-dimensional space. I know how to do for each individual word, but I do not know how to do it in case of sentences.
Note that word2vec is not inherently a method for modeling sentences, only words. So there's no single, official way to use word2vec to represent sentences.
Once quick & crude approach is to create a vector for a sentence (or other multi-word text) by averaging all the word-vectors together. It's fast, it's better-than-nothing, and does ok on some simple (broadly-topical) tasks - but isn't going to capture the full meaning of a text very well, especially any meaning which is dependent on grammar, polysemy, or sophisticated contextual hints.
Still, you could use it to get a fixed-size vector per short text, and calculate pairwise similarities/distances between those vectors, and feed the results into dimensionality-reduction algorithms for visualization or other purposes.
Other algorithms actually create vectors for longer texts. A shallow algorithm very closely related to word2vec is 'paragraph vectors', available in Gensim as the Doc2Vec class. But it's still not very sophisticated, and still not grammar-aware. A number of deeper-network text models like BERT, ELMo, & others may be possibilities.
Word2vec & related algorithms are very data-hungry: all of their beneficial qualities arise from the tug-of-war between many varied usage examples for the same word. So if you have a toy-sized dataset, you won't get a set of vectors with useful interrelationships.
But also, rare words in your larger dataset won't get good vectors. It is typical in training to discard, as if they weren't even there, words that appear below some min_count frequency - because not only would their vectors be poor, from just one or a few idiosyncratic sample uses, but because there are many such underrepresented words in total, keeping them around tends to make other word-vectors worse, too. They're noise.
So, your proposed idea of taking individual instances of travel & replacing them with single-appearance tokens is note very likely to give interesting results. Lowering your min_count to 1 will get you vectors for each variant - but they'll be of far worse (& more-random) quality than your other word-vectors, having receiving comparatively little training attention compared to other words, and each being fully influenced by just their few surrounding words (rather than the entire range of all surrounding contexts that could all help contribute to the useful positioning of a unified travel token).
(You might be able to offset these problems, a little, by (1) retaining the original version of the sentence, so you still get a travel vector; (2) repeating your token-mangled sentences several times, & shuffling them to appear throughout the corpus, to somewhat simulate more real occurrences of your synthetic contexts. But without real variety, most of the problems of such single-context vectors will remain.)
Another possible way to compare travel_sent_A, travel_sent_B, etc would be to ignore the exact vector for travel or travel_sent_X entirely, but instead compile a summary vector for the word's surrounding N words. For example if you have 100 examples of the word travel, create 100 vectors that are each of the N words around travel. These vectors might show some vague clusters/neighborhoods, especially in the case of a word with very-different alternate meanings. (Some research adapting word2vec to account for polysemy uses this sort of context vector approach to influence/choose among alternate word-senses.)
You might also find this research on modeling words as drawing from alternate 'atoms' of discourse interesting: Linear algebraic structure of word meanings
To the extent you have short headline-like texts, and only word-vectors (without the data or algorithms to do deeper modeling), you may also want to look into the "Word Mover's Distance" calculation for comparing texts. Rather than reducing a single text to a single vector, it models it as a "bag of word-vectors". Then, it defines a distance as a cost-to-transform one bag to another bag. (More similar words are easier to transform into each other than less-similar words, so expressions that are very similar, with just a few synonyms replaced, report as quite close.)
It can be quite expensive to calculate on longer texts, but may work well for short phrases and small sets of headlines/tweets/etc. It's available on the Gensim KeyedVector classes as wmdistance(). An example of the kinds of correlations it may be useful in discovering is in this article: Navigating themes in restaurant reviews with Word Mover’s Distance
If you are interested in comparing sentences, Word2Vec is not the best choice. It was shown that using it to create sentence embedding produces inferior results than a dedicated sentence embedding algorithm. If your dataset is not huge, you can't create (train a new) embedding space using your own data. This forces you to use a pre trained embedding for the sentences. Luckily, there are enough of those nowadays. I believe that Universal Sentence Encoder (by Google) will suit your needs best.
Once you get vector representation for you sentences you can go 2 ways:
create a matrix of pairwise comparisons and visualize it as a heatmap. This representation is useful when you have some prior knowledge about how close are the sentences and you want to check you hypothesis. You can even try it online.
run t-SNE on the vector representations. This will create a 2D projection of the sentences that will preserve relative distances between them. It presents data much better than PCA. Than you can easily find neighbors of the certain sentence:
You can learn more from this and this
Interesting take on the word2vec model, You can use T-SNE embeddings of the vectors and reduce the dimensionality to 3 and visualise them using any plotting library such matplotlib or dash. I also find this tools helpful when visualising word embeddings: https://projector.tensorflow.org/
The idea of learning different word embeddings for words in different context is the premise of ELMO(https://allennlp.org/elmo) but you will require a huge training set to train it. Luckily, if your application is not very specific you can use pre-trained models.

Convert list of words in Text file to Word Vectors

I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword.
My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. Now my file here is independent and contains different keywords in each row.
My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts
**My File Structure: **
Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen
Expected
search Text : 'WAL'
Result from My File:
WALGREEN
WALMART
WALMART
Embeddings
Lets step back and understand what is word2vec. Word2vec (like Glove, FastText etc) is a way to represent words as vectors. ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). One-hot encoding is one naive way of encoding words as vectors. But for a large vocabulary one-hot encoding become too long. Also there is no semantic relationship between one-hot encoded word.
With DL came the distributed representation of words (called word embeddings). One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. i.e distance(apple,orange) < distance(apple,cat)
So how are these embedding model trained ? The embedding models are trained on (very) huge corpus of text. When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. It will learn that the apple and orange are related. So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context).
However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings.
Out of vocabulary (OOV) words
Word embedding like word2vec and Glove cannot return an embedding for OOV words. However the embeddings like FastText (thanks to #gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word.
Problem
Coming to your problem,
Case 1: lets say the user enters a word WAL, first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. Embeddings like FastText handling them by breaking it into n-grams. This approach gives good embeddings for misspelled words or slang.
Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. It will rather be close to words like window, paint, door.
Conclusion
If your search is for semantically similar words, then solution using vector embeddings will be good. On the other hand, if your search is based on lexicons then vectors embeddings will be of no help.
If you wanted to find walmart from a fragment like wal, you'd more likely use something like:
a substring or prefix search through all entries; or
a reverse-index-of-character-n-grams; or
some sort of edit-distance calculated against all entries or a subset of likely candidates
That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words.
If in fact you want to find similar stores, word-vectors might theoretically be useful. But the problem given your example input is that such word-vector algorithms require examples of tokens used in context, from sequences-of-tokens that co-appear in natural-language-like relationships. And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships.
While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. Some ideas might be:
lists of stores visited by a single customer
lists of stores carrying the same product/UPC
text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.)

word2vec - what is best? add, concatenate or average word vectors?

I am working on a recurrent language model. To learn word embeddings that can be used to initialize my language model, I am using gensim's word2vec model.
After training, the word2vec model holds two vectors for each word in the vocabulary: the word embedding (rows of input/hidden matrix) and the context embedding (columns of hidden/output matrix).
As outlined in this post there are at least three common ways to combine these two embedding vectors:
summing the context and word vector for each word
summing & averaging
concatenating the context and word vector
However, I couldn't find proper papers or reports on the best strategy. So my questions are:
Is there a common solution whether to sum, average or concatenate the vectors?
Or does the best way depend entirely on the task in question? If so, what strategy is best for a word-level language model?
Why combine the vectors at all? Why not use the "original" word embeddings for each word, i.e. those contained in the weight matrix between input and hidden neurons.
Related (but unanswered) questions:
word2vec: Summing/concatenate inside and outside vector
why we use input-hidden weight matrix to be the word vectors instead of hidden-output weight matrix?
I have found an answer in the Stanford lecture "Deep Learning for Natural Language Processing" (Lecture 2, March 2016). It's available here. In minute 46 Richard Socher states that the common way is to average the two word vectors.
You should read this research work at-least once to get the whole idea of combining word embeddings using different algebraic operators. It was my research.
In this paper you can also see the other methods to combine word vectors.
In short L1-Normalized average word vectors and sum of words are good representations.
I don't know any work that empirically tests different ways of combining the two vectors, but there is a highly influencial paper comparing: 1) just use the word vector, and 2) adding up word and context vector. The paper is here: https://www.aclweb.org/anthology/Q15-1016/.
First, note that the metric is analogy and similarity tests, NOT downstream tasks.
Here is a quote from the paper:
for both SGNS and GloVe, it is worthwhile to experiment with the w + c variant [adding up word and context vectors], which is cheap to
apply (does not require retraining) and can result
in substantial gains (as well as substantial losses).
So I guess you just need to try it out on your specific task.
By the way, here is a post on how to get context vectors from gensim: link
I thought I attempt to answer based on the comments.
The question you are linking to is: "WordVectors How to concatenate word vectors to form sentence vector"
Word vectors can be compared on its own. But often one wants to put the sentence, paragraph or a document in context - i.e. a collection of words. And then the question arises how to combine those to a single vector (gensim provides doc2vec for that use case).
That doesn't seem to be applicable in your case and I would just work with the given word vectors. You can adjust parameters like the size of the embedding, the training data, other algorithms. You could even combine vectors from different algorithms to create a kind of 'ensemble vector' (e.g. word2vec with GloVe). But it may not be more efficient.
Sometimes in language the same word has a different meaning depending on the type of word within a sentence or a combination of words. e.g. 'game' has a different meaning to 'fair game'. Sense2Vec offers a proposal to generate word vectors for those compound words: https://explosion.ai/blog/sense2vec-with-spacy
(Of course, in that case you already need something that understands the sentence structure, such as SpaCy)

What are the specifc steps for computing sentence vectors from word2vec word vectors using the averaging method?

Beginner question, but I am a bit puzzled by this. Hope the answer to this question can benefit other beginners in NLP as well.
Here are some more details:
I know that you can compute sentence vectors from word vectors generated by word2vec. But what are the actual steps involved to make these sentence vectors. Can anyone provide a intuitive example and then some calculations to explain this process?
eg: Suppose I have a sentence with three words: Today is hot. And suppose these words have hypothetical vector values of: (1,2,3)(4,5,6)(7,8,9). Do I get the sentence vector by performing component-wise averaging of these word vectors? And what if the vectors are of different length eg: (1,2)(4,5,6)(7,8,9,23,76) what does the averaging process look like for these cases?
Creating the vector for a length-of-text (sentence/paragraph/document) by averaging the word-vectors is one simple approach. (It's not great at capturing shades-of-meaning, but it's easy to do.)
Using the gensim library, it can be as simple as:
import numpy as np
from gensim.models.keyedvectors import KeyedVectors
wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
text = "the quick brown fox jumped over the lazy dog"
text_vector = np.mean([wv[word] for word in text.split()], axis=0)
Whether to use the raw word-vectors, or word-vectors that are either unit-normalized or otherwise weighted by some measure of word significance are alternatives to consider.
Word-vectors that are compatible with each other will have the same number of dimensions, so there's never an issue of trying to average differently-sized vectors.
Other techniques like 'Paragraph Vectors' (Doc2Vec in gensim) might give better text-vectors for some purposes, on some corpuses.
Other techniques for comparing the similarity of texts that leverage word-vectors, like "Word Mover's Distance" (WMD), might give better pairwise text-similarity scores than comparing single summary vectors. (WMD doesn't reduce a text to a single vector, and can be expensive to calculate.)
For your example, the averaging of the 3 word vectors (each of 3 dimensions) would yield one single vector of 3 dimensions.
Centroid-vec = 1/3*(1+4+7, 2+5+8, 3+6+9) = (4, 5, 6)
A better way to get a single vector for a document is to use paragraph vectors commonly known as doc2vec.

python glove similarity measure calculation

i am trying to understand how python-glove computes most-similar terms.
Is it using cosine similarity?
Example from python-glove github
https://github.com/maciejkula/glove-python/tree/master/glove
:
I know that from gensim's word2vec, the most_similar method computes similarity using cosine distance.
The project website is a bit unclear on this point:
The Euclidean distance (or cosine similarity) between two word vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding words.
Euclidean distance is not the same as cosine similarity. It sounds like either works well enough, but it does not specify which is used.
However, we can observe the source of the repo you are looking at to see:
dst = (np.dot(self.word_vectors, word_vec)
/ np.linalg.norm(self.word_vectors, axis=1)
/ np.linalg.norm(word_vec))
It uses cosine similarity.
On the glove project website, this is explained with a fair amount of clarity.
http://www-nlp.stanford.edu/projects/glove/
In order to capture in a quantitative way the nuance necessary to distinguish man from woman, it is necessary for a model to associate more than a single number to the word pair. A natural and simple candidate for an enlarged set of discriminative numbers is the vector difference between the two word vectors. GloVe is designed in order that such vector differences capture as much as possible the meaning specified by the juxtaposition of two words.
To read more about the math behind this, check the "Model overview" section in the website
yes it uses the cosine similarity.
the paper mentioning that in text : ... A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity....

Categories