Find the closest word to set of words

Find the closest word to set of words - python

I would need to find something like the opposite of model.most_similar()
While most_similar() returns an array of words most similar to the one given as input, I need to find a sort of "center" of a list of words.
Is there a function in gensim or any other tool that could help me?
Example:
Given {'chimichanga', 'taco', 'burrito'} the center would be maybe mexico or food, depending on the corpus that the model was trained on

If you supply a list of words as the positive argument to most_similar(), it will report words closest to their mean (which would seem to be one reasonable interpretation of the words' 'center').
For example:
sims = model.most_similar(positive=['chimichanga', 'taco', 'burrito'])
(I somewhat doubt the top result sims[0] here will be either 'mexico' or 'food'; it's most likely to be another mexican-food word. There isn't necessarily a "more generic"/hypernym relation to be found either between word2vec words, or in certain directions... but some other embedding techniques, such as hyperbolic embeddings, might provide that.)

Related

High Precision Word Alignment Algorithm in Python

I am working on a project for building a high precision word alignment between sentences and their translations in other languages, for measuring translation quality. I am aware of Giza++ and other word alignment tools that are used as part of the pipeline for Statistical Machine Translation, but this is not what I'm looking for. I'm looking for an algorithm that can map words from the source sentence into the corresponding words in the target sentence, transparently and accurately given these restrictions:
the two languages do not have the same word order, and the order keeps changing
some words in the source sentence do not have corresponding words in the target sentence, and vice versa
sometimes a word in the source correspond to multiple words in the target, and vice versa, and there can be many-to-many mapping
there can be sentences where the same word is used multiple times in the sentence, so the alignment needs to be done with the words and their indexes, not only words
Here is what I did:
Start with a list of sentence pairs, say English-German, with each sentence tokenized to words
Index all words in each sentence, and create an inverted index for each word (e.g. the word "world" occurred in sentences # 5, 16, 19, 26 ... etc), for both source and target words
Now this inverted index can predict the correlation between any source word and any target word, as the intersection between the two words divided by their union. For example, if the tagret word "Welt" occurs in sentences 5, 16, 26,32, The correlation between (world, Welt) is the number of indexes in the intersection (3) divided by the number of indexes in the union (5), and hence the correlation is 0.6. Using the union gives lower correlation with high frequency words, such as "the", and the corresponding words in other languages
Iterate over all sentence pairs again, and use the indexes for the source and target words for a given sentence pairs to create a correlation matrix
Here is an example of a correlation matrix between an English and a German sentence. We can see the challenges discussed above.
In the image, there is an example of the alignment between an English and German sentence, showing the correlations between words, and the green cells are the correct alignment points that should be identified by the word-alignment algorithm.
Here is some of what I tried:
It is possible in some cases that the intended alignment is simply the word pair with the highest correlation in its respective column and row, but in many cases it's not.
I have tried things like Dijkstra's algorithm to draw a path connecting the alignment points, but it doesn't seem to work this way, because it seems you can jump back and forth to earlier words in the sentence because of the word order, and there is no sensible way to skip words for which there is no alignment.
I think the optimum solution will involve something
like expanding rectangles which start from the most likely
correspondences, and span many-to-many correspondences, and skip
words with no alignment, but I'm not exactly sure what would be a
good way to implement this
Here is the code I am using:
import random
src_words=["I","know","this"]
trg_words=["Ich","kenne","das"]
def match_indexes(word1,word2):
return random.random() #adjust this to get the actual correlation value
all_pairs_vals=[] #list for all the source (src) and taget (trg) indexes and the corresponding correlation values
for i in range(len(src_words)): #iterate over src indexes
src_word=src_words[i] #identify the correponding src word
for j in range(len(trg_words)): #iterate over trg indexes
trg_word=trg_words[j] #identify the correponding trg word
val=match_indexes(src_word,trg_word) #get the matching value from the inverted indexes of each word (or from the data provided in the speadsheet)
all_pairs_vals.append((i,j,val)) #add the sentence indexes for scr and trg, and the corresponding val
all_pairs_vals.sort(key=lambda x:-x[-1]) #sort the list in descending order, to get the pairs with the highest correlation first
selected_alignments=[]
used_i,used_j=[],[] #exclude the used rows and column indexes
for i0,j0,val0 in all_pairs_vals:
if i0 in used_i: continue #if the current column index i0 has been used before, exclude current pair-value
if j0 in used_j: continue #same if the current row was used before
selected_alignments.append((i0,j0)) #otherwise, add the current pair to the final alignment point selection
used_i.append(i0) #and include it in the used row and column indexes so that it will not be used again
used_j.append(j0)
for a in all_pairs_vals: #list all pairs and indicate which ones were selected
i0,j0,val0=a
if (i0,j0) in selected_alignments: print(a, "<<<<")
else: print(a)
It's problematic because it doesn't accomodate the many-to-many, or even the one to many alignments, and can err easily in the beginning by selecting a wrong pair with highest correlation, excluding its row and column from future selection. A good algorithm would factor in that a certain pair has the highest correlation in its respective row/column, but would also consider the proximity to other pairs with high correlations.
Here is some data to try if you like, it's in Google sheets:
https://docs.google.com/spreadsheets/d/1-eO47RH6SLwtYxnYygow1mvbqwMWVqSoAhW64aZrubo/edit?usp=sharing

Word alignment remains an open research topic to some extent. The probabilistic models behind Giza++ are fairly non-trivial, see: http://www.ee.columbia.edu/~sfchang/course/svia/papers/brown-machine-translate-93.pdf
There is a lot of existing approaches you could take, such as:
implement the "IBM models" used by Giza++ yourself (or if you're brave, try the NLTK implementation)
implement the (much much simpler) algorithm behind fast_align https://www.aclweb.org/anthology/N13-1073/
implement some form of HMM-based alignment https://www.aclweb.org/anthology/C96-2141/
use deep learning, there are multiple possibilities there; this paper seems to contain a nice overview of approaches https://www.aclweb.org/anthology/P19-1124.pdf (typically people try to leverage the attention mechanism of neural MT models to do this)
This is a very difficult machine learning problem and while it's not impossible that simple approaches such as yours could work, it might be a good idea to study the existing work first. That being said, we have seen quite a few breakthroughs from surprisingly simple techniques in this field so who knows :-)

I highly recommend testing Awesome-Align. It relies on multilingual BERT (mBERT) and the results look very promising. I even tested it with Arabic, and it did a great job on a difficult alignment example since Arabic is a morphology-rich language, and I believe it would be more challenging than a Latin-based language such as German.
As you can see, one word in Arabic corresponds to multiple words in English, and yet Awesome-Align managed to handle the many-to-many mapping to a great extent. You may give it a try and I believe it will meet your needs.
There is also a Google Colab demo at https://colab.research.google.com/drive/1205ubqebM0OsZa1nRgbGJBtitgHqIVv6?usp=sharing#scrollTo=smW6s5JJflCN
Good luck!

Recently, there were also two papers using bi-/multilingual word/contextual embeddings to do the word alignment. Both of them construct a bipartite graph where the words are weighted with their embedding distances and use graph algorithms to get the alignment.
One paper does a maximum matching between the graph parts. Because the matching is not symmetrical, they do it from both sides and use similar symmetrization heuristics as FastAlign.
The other one mentions the alignment only briefly uses minimum-weighted edge cover on the graph and uses it as the alignment.
Both of them claim to be better than FastAlign.

As the question is specifically addressing Python implementations, and Giza++ and FastAlign still seem to represent SOTA, one might look into
https://pypi.org/project/systran-align/: replicates FastAlign. Seems to be relatively mature. Also note that the original FastAlign code contains a Python wrapper (https://github.com/clab/fast_align/blob/master/src/force_align.py).
https://www.nltk.org/api/nltk.align.html: replicates most GIZA models (a good compromise between performance and quality is IBM4). However, it is rather unclear how thoroughly tested and how well maintained that is, as people generally prefer to work with GIZA++ directly.
Most research code on the topic will nowadays come in Python and be based on embeddings, e.g., https://github.com/cisnlp/simalign, https://github.com/neulab/awesome-align, etc. However, the jury is still out on whether they outperform the older models and if so, for which applications. In the end, you need to go for a compromise between context awareness (reordering!), precision, recall and runtime. Neural models have great potential on being more context aware, statistical models have more predictable behavior.

Convert list of words in Text file to Word Vectors

I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword.
My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. Now my file here is independent and contains different keywords in each row.
My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts
**My File Structure: **
Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen
Expected
search Text : 'WAL'
Result from My File:
WALGREEN
WALMART
WALMART

Embeddings
Lets step back and understand what is word2vec. Word2vec (like Glove, FastText etc) is a way to represent words as vectors. ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). One-hot encoding is one naive way of encoding words as vectors. But for a large vocabulary one-hot encoding become too long. Also there is no semantic relationship between one-hot encoded word.
With DL came the distributed representation of words (called word embeddings). One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. i.e distance(apple,orange) < distance(apple,cat)
So how are these embedding model trained ? The embedding models are trained on (very) huge corpus of text. When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. It will learn that the apple and orange are related. So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context).
However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings.
Out of vocabulary (OOV) words
Word embedding like word2vec and Glove cannot return an embedding for OOV words. However the embeddings like FastText (thanks to #gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word.
Problem
Coming to your problem,
Case 1: lets say the user enters a word WAL, first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. Embeddings like FastText handling them by breaking it into n-grams. This approach gives good embeddings for misspelled words or slang.
Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. It will rather be close to words like window, paint, door.
Conclusion
If your search is for semantically similar words, then solution using vector embeddings will be good. On the other hand, if your search is based on lexicons then vectors embeddings will be of no help.

If you wanted to find walmart from a fragment like wal, you'd more likely use something like:
a substring or prefix search through all entries; or
a reverse-index-of-character-n-grams; or
some sort of edit-distance calculated against all entries or a subset of likely candidates
That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words.
If in fact you want to find similar stores, word-vectors might theoretically be useful. But the problem given your example input is that such word-vector algorithms require examples of tokens used in context, from sequences-of-tokens that co-appear in natural-language-like relationships. And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships.
While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. Some ideas might be:
lists of stores visited by a single customer
lists of stores carrying the same product/UPC
text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.)

Find similarity with doc2vec like word2vec

Is there a way to find similar docs like we do in word2vec
Like:
model2.most_similar(positive=['good','nice','best'],
negative=['bad','poor'],
topn=10)
I know we can use infer_vector,feed them to have similar ones, but I want to feed many positive and negative examples as we do in word2vec.
is there any way we can do that! thanks !

The doc-vectors part of a Doc2Vec model works just like word-vectors, with respect to a most_similar() call. You can supply multiple doc-tags or full vectors inside both the positive and negative parameters.
So you could call...
sims = d2v_model.docvecs.most_similar(positive=['doc001', 'doc009'], negative=['doc102'])
...and it should work. The elements of the positive or negative lists could be doc-tags that were present during training, or raw vectors (like those returned by infer_vector(), or your own averages of multiple such vectors).

Don't believe there is a pre-written function for this.
One approach would be to write a function that iterates through each word in the positive list to get top n words for a particular word.
So for positive words in your question example, you would end up with 3 lists of 10 words.
You could then identify words that are common across the 3 lists as the top n similar to your positive list. Since not all words will be common across the 3 lists, you probably need to get top 20 similar words when iterating so you end up with top 10 words as you want in your example.
Then do the same for negative words.

How does similarity function in SpaCy works?

I have tried one e.g,
'Positive' and 'Negative' they are not similar words instead they are opposite but still spaCy gives me 81% similarity ratio for them.
here is my code,
import spacy
nlp = spacy.load('en_core_web_lg')
word1 = nlp(u'negative')
word2 = nlp(u'positive')
word1_word2 = word1.similarity(word2)
print(word1_word2)

Typically, word similarities like this are computed using cosine similarity between their corresponding word vectors. Words often used in the same contexts end up in similar locations in the vector space, on the assumption that words that get used similarly mean similar things. E.g., King and Queen might be similar, and King and Man might be similar, but Queen and Man should be a bit less similar (though they still both refer to "people", and they're both nouns, so they'll probably still be more similar than, say, Man and Combusted).
You want these words ('Positive' and 'Negative') to be negatives of each other (cosine similarity of -1), but they're similar because they're almost exactly the same word besides one being the negation of the other. The global semantic vector space incorporates many more ideas than just negation, and so these two words end up being very similar in other ways. What you can do is compute their average vector, then Positive -> average = - (Negative -> average), and that difference vector Positive -> average (or, more precisely, "Positive" - ("Positive" - "Negative") / 2) would approximate the idea of negation that you're particularly interested in. That is, you could then add that vector to other cases to negate them too, e.g. "Yes" + ("Negative" - "Positive") ~= "No"
All that just to say, the effect you're observing is not a fault of Spacy, and you won't avoid it by using Gensim or Sklearn, it's due the nature of what "similarity" means in this context. If you want more comprehensible, human-designed semantic relationships between words, consider looking at WordNet, which is manually created and would be more likely to explicitly have some "negation" relation between your two words.

Get most similar words, given the vector of the word (not the word itself)

Using the gensim.models.Word2Vec library, you have the possibility to provide a model and a "word" for which you want to find the list of most similar words:
model = gensim.models.Word2Vec.load_word2vec_format(model_file, binary=True)
model.most_similar(positive=[WORD], topn=N)
I wonder if there is a possibility to give the system as input the model and a "vector", and ask the system to return the top similar words (which their vectors is very close to the given vector). Something similar to:
model.most_similar(positive=[VECTOR], topn=N)
I need this functionality for a bilingual setting, in which I have 2 models (English and German), as well as some English words for which I need to find their most similar German candidates.
What I want to do is to get the vector of each English word from the English model:
model_EN = gensim.models.Word2Vec.load_word2vec_format(model_file_EN, binary=True)
vector_w_en=model_EN[WORD_EN]
and then query the German model with these vectors.
model_DE = gensim.models.Word2Vec.load_word2vec_format(model_file_DE, binary=True)
model_DE.most_similar(positive=[vector_w_en], topn=N)
I have implemented this in C using the original distance function in the word2vec package. But, now I need it to be in python, in order to be able to integrate it with my other scripts.
Do you know if there is already a method in gensim.models.Word2Vec library or other similar libraries which does this? Do I need to implement it by myself?

The method similar_by_vector returns the top-N most similar words by vector:
similar_by_vector(vector, topn=10, restrict_vocab=None)

I don't think what you're trying to achieve could ever give an accurate answer. Simply because the two models are trained separately. And although both the English and the German model will have similar distances between their respective word vectors. There's no guarantee that the word vector for 'House' will have the same direction as the word vector for 'Haus'.
In simple terms, if you trained both models with vector size=3.
And 'House' has vector [0.5,0.2,0.9], there's no guarantee that 'Haus' will have vector [0.5,0.2,0.9] or even something close to that.
In order to solve this, you could first translate the English word to German and then use the vector for that word to look for similar words in the German model.
TL:DR;
You can't just plug in vectors from one language model into another and expect to have accurate results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.