How to obtain antonyms through word2vec? - python

I am currently working on word2vec model using gensim in Python, and want to write a function that can help me find the antonyms and synonyms of a given word.
For example:
antonym("sad")="happy"
synonym("upset")="enraged"
Is there a way to do that in word2vec?

In word2vec you can find analogies, the following way
model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.most_similar(positive=['good', 'sad'], negative=['bad'])
[(u'wonderful', 0.6414928436279297),
(u'happy', 0.6154338121414185),
(u'great', 0.5803680419921875),
(u'nice', 0.5683973431587219),
(u'saddening', 0.5588893294334412),
(u'bittersweet', 0.5544661283493042),
(u'glad', 0.5512036681175232),
(u'fantastic', 0.5471092462539673),
(u'proud', 0.530515193939209),
(u'saddened', 0.5293528437614441)]
Now using some standard antonyms like (good, bad), (rich, poor), find multiple such lists of nearest antonyms. After that you can use average of vectors of this list.

I think it is possible to obtain antonym using
king-men+women=queen analogies.
in here queen (antonym of king and synonym of women) is the result that return from word2vec trained model.
let we say there is a word X and its synonym Y. and also have antonym of Y which is Z. then we can say X-Y + Z = antonym of (X) and synonym of(Z).

Related

Word2Vec/Doc2vec + K-Means clustering - how to extract "semantically meaningful" centroids?

I have a list of unique terms, say for example:
['Dollars','Cash','International Currency','Credit card','Comics','loans','David Beckham','soccer','Iron Man','checks','Euros','World Cup','Marvel Cinematic Universe','Champions league','Superman']
Ultimately I want to achieve the following mapping:
['Dollars','Cash','International Currency','Credit card','loans','checks','Euros','World']: 'Money and finance'
['Comics','Iron Man','Marvel Cinematic Universe','Superman']: 'Comics and Superherores'
['David Beckham','soccer','World Cup','Champions league']: 'Soccer, Football'
My idea is to use a text embedding like word2vec or doc2vec, and then cluster the embeddings using K-Means. So far this is very straightforward. But then I would like to map the resulting embedding centroids to 2 or 3 relevant terms. Is there a way to go from the numerical embedding centroid to semantically meaningful terms?
If there is a better way to do this other than Embedding > Clustering > Extract meaning from centroid I could try that as well.
A couple of things to note about this is that: The terms in my lists are unique - either individual words, compound terms, or vert short sentences, not paragraphs or documents - so frequency or word count based methods are not applicable. And the lists also contain a lot of noise, e.g. "xxx thx u" and "Hello Mr. Johnson", etc...
So my two asks are:
What is the best way to achieve this mapping?
And how can we map a centroid from an embedding space to a small set of meaningful terms?

Remove duplicate tweets that are 90% similar

I have extracted the tweets and i want to remove duplicate tweets. If I use padas drop_duplicates(inplace=True) it will remove 100% duplicated tweets.
I want to know is there a way to remove that are slightly different from each other and 90% similar to each other.
Example
When will this year end? There is only misery and bad stuff! I hate 2020!
When will this year end? There are miseries and bad stuff only! I hate 2020!
These tweets are almost similar, how can I remove them
There's no simple answer to your question, but a couple of naive solutions might look something like the following.
Approach 1
Firstly, you need to define a similarity metric. A common (character based) string comparison metric is Levenshtein distance, but I'd recommend having a look through fuzzywuzzy's README to find one that's appropriate for your use case. For this micro-demo, instead of using †he fuzzywuzzy package, I'm using python-levenshtein.
Secondly, compare all strings to all other tweets and compute the string similarities between them. Note that this is totally impractical if you're dealing with a large number of tweets, but let's roll with it. After comparing the strings, you can filter to get the indexes of other strings which are close matches.
Using those indexes, we can create a graph of strings, for which I use the networkx package. This is necessary so we can extract the connect components of the graph, for which each connected component represents a network of similar strings. This is not necessarily true, since for deep graphs a string at one end won't necessarily be all that similar to a string at the other end. But in practice, it turns out to work pretty well.
Setup
import networkx as nx
import Levenshtein
import random
df = pd.DataFrame({
"tweet":["When will this year end? There is only misery and bad stuff! I hate 2020!",
"When will this year end? There are miseries and bad stuff only! I hate 2020!",
"I am a tweet with no obvious duplicates",
"Tweeeeeet!",
"Tweeet",
"Tweet tweet!"]
})
Logic
def compare(tweet1, threshold=0.7):
# compare tweets using Levenshtein distance (or whatever string comparison metric)
matches = df['tweet'].apply(lambda tweet2: (Levenshtein.ratio(tweet1, tweet2) >= threshold))
# get positive matches
matches = matches[matches].index.tolist()
# convert to list of tuples
return [*zip(iter(matches[:-1]), iter(matches[1:]))]
# create graph objects
nodes = df.index.tolist()
edges = [*itertools.chain(*df["tweet"].apply(compare))]
# create graphs
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
# get connected component indexes
grouped_indexes = [*nx.connected_components(G)]
# get a random choice index from each group
filtered_indexes = [random.choice([*_]) for _ in grouped_indexes]
df.loc[filtered_indexes]
Output
A filtered subset of the original tweets DataFrame.
tweet
0 When will this year end? There is only misery ...
2 I am a tweet with no obvious duplicates
5 Tweet tweet!
Approach 2
We can use an unsupervised learning algorithm to cluster strings together, something like k-means This is your bread and butter of unsupervised algorithms, and it has the drawback that you have to know the optimal number of clusters in advance, or more typically work it out via testing. But it has the great advantage that if you're adding more tweets to your dataset, you can quickly apply your clustering model and find similar tweets.
There's a million and one tutorials on how to do this, but the basic approach here would be to (1) clean your text, (2) convert your text into a TFIDF, (3) compute a similarity metric (cosine similarity is common) between each document pair, (4) then train your k-means (or similar) model.
If you're interested in this approach, here's a few random tutorials to follow I found after a quick Google.
https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
http://brandonrose.org/clustering
https://iq.opengenus.org/document-clustering-nlp-kmeans/
...
Hope this helps!
You can use Cosine Similarity between two tweets:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
X =tweet1
Y =tweet2
# tokenization
X_set = word_tokenize(X)
Y_set = word_tokenize(Y)
l1 =[];l2 =[]
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in vector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)
if cosine>=0.90:
print("Similar")

Best match for input query from a set of documents

I have 8 documents and I ran TF-IDF on it to get an array. I don't understand how I find out which is the best document match for a given input query?
all_documents = [doc1, doc2, ...., doc7]
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents).toarray()
Transform the input to tf-idf format using TfidfVectorizer. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.
Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf object that you created has an attribute vocabulary_ that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.
Example
Document1: dogs are cute
Document2: cats are awful
Leads to a vocabulary of [dogs, cats, are, cute, awful]. A query containing other words than these 5 cannot be used. For example if your query is cute animals, the animals has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0] since cute is the only word that can be found in the documents.

Python - tf-idf predict a new document similarity

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea.
How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])
This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.
The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.
Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.
You should take a look at gensim. Example starting code looks like this:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
At prediction time you first get the vector for the new doc:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]
Then get the similarities (sorted by most similar):
sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).
For huge data sets, there is a solution called Text Clustering By Concept. search engines use this Technic,
At first step, you cluster your documents to some groups(e.g 50 cluster), then each cluster has a representative document(that contain some words that have some useful information about it's cluster)
At second step, for calculating cosine similarity between New Document and you data set, you loop through all representative(50 numbers) and find top near representatives(e.g 2 representative)
At final step, you can loop through all documents in selected representative and find nearest cosine similarity
With this Technic, you can reduce the number of loops and improve performace,
You can read more tecnincs in some chapter of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Matching words and vectors in gensim Word2Vec model

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Problem:
Match words to embedding vectors created by Word2Vec () -- how do I do it?
My approach:
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.
I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441
If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.
If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:
import gensim
documents = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)
which outputs:
[('survey', 1.0000001192092896)]
where the number represents the similarity.
However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.
So I found an easy way to do this, where nmodel is the name of your model.
#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)
#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]
This is based on the gensim code. For older versions of gensim, you might need to drop the wv after the model.
As #bpachev mentioned, gensim does have an option of searching by vector, namely similar_by_vector.
It however implements a brute force linear search, i.e. computes cosine similarity between given vector and vectors of all words in vocabulary, and gives off the top neighbours. An alternate option, as mentioned in the other answer is to use an approximate nearest neighbour search algorithm like FLANN.
Sharing a gist demonstrating the same:
https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

Categories