Word2Vec/Doc2vec + K-Means clustering - how to extract "semantically meaningful" centroids?

Word2Vec/Doc2vec + K-Means clustering - how to extract "semantically meaningful" centroids? - python

I have a list of unique terms, say for example:
['Dollars','Cash','International Currency','Credit card','Comics','loans','David Beckham','soccer','Iron Man','checks','Euros','World Cup','Marvel Cinematic Universe','Champions league','Superman']
Ultimately I want to achieve the following mapping:
['Dollars','Cash','International Currency','Credit card','loans','checks','Euros','World']: 'Money and finance'
['Comics','Iron Man','Marvel Cinematic Universe','Superman']: 'Comics and Superherores'
['David Beckham','soccer','World Cup','Champions league']: 'Soccer, Football'
My idea is to use a text embedding like word2vec or doc2vec, and then cluster the embeddings using K-Means. So far this is very straightforward. But then I would like to map the resulting embedding centroids to 2 or 3 relevant terms. Is there a way to go from the numerical embedding centroid to semantically meaningful terms?
If there is a better way to do this other than Embedding > Clustering > Extract meaning from centroid I could try that as well.
A couple of things to note about this is that: The terms in my lists are unique - either individual words, compound terms, or vert short sentences, not paragraphs or documents - so frequency or word count based methods are not applicable. And the lists also contain a lot of noise, e.g. "xxx thx u" and "Hello Mr. Johnson", etc...
So my two asks are:
What is the best way to achieve this mapping?
And how can we map a centroid from an embedding space to a small set of meaningful terms?

Related

Remove duplicate tweets that are 90% similar

I have extracted the tweets and i want to remove duplicate tweets. If I use padas drop_duplicates(inplace=True) it will remove 100% duplicated tweets.
I want to know is there a way to remove that are slightly different from each other and 90% similar to each other.
Example
When will this year end? There is only misery and bad stuff! I hate 2020!
When will this year end? There are miseries and bad stuff only! I hate 2020!
These tweets are almost similar, how can I remove them

There's no simple answer to your question, but a couple of naive solutions might look something like the following.
Approach 1
Firstly, you need to define a similarity metric. A common (character based) string comparison metric is Levenshtein distance, but I'd recommend having a look through fuzzywuzzy's README to find one that's appropriate for your use case. For this micro-demo, instead of using †he fuzzywuzzy package, I'm using python-levenshtein.
Secondly, compare all strings to all other tweets and compute the string similarities between them. Note that this is totally impractical if you're dealing with a large number of tweets, but let's roll with it. After comparing the strings, you can filter to get the indexes of other strings which are close matches.
Using those indexes, we can create a graph of strings, for which I use the networkx package. This is necessary so we can extract the connect components of the graph, for which each connected component represents a network of similar strings. This is not necessarily true, since for deep graphs a string at one end won't necessarily be all that similar to a string at the other end. But in practice, it turns out to work pretty well.
Setup
import networkx as nx
import Levenshtein
import random
df = pd.DataFrame({
"tweet":["When will this year end? There is only misery and bad stuff! I hate 2020!",
"When will this year end? There are miseries and bad stuff only! I hate 2020!",
"I am a tweet with no obvious duplicates",
"Tweeeeeet!",
"Tweeet",
"Tweet tweet!"]
})
Logic
def compare(tweet1, threshold=0.7):
# compare tweets using Levenshtein distance (or whatever string comparison metric)
matches = df['tweet'].apply(lambda tweet2: (Levenshtein.ratio(tweet1, tweet2) >= threshold))
# get positive matches
matches = matches[matches].index.tolist()
# convert to list of tuples
return [*zip(iter(matches[:-1]), iter(matches[1:]))]
# create graph objects
nodes = df.index.tolist()
edges = [*itertools.chain(*df["tweet"].apply(compare))]
# create graphs
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
# get connected component indexes
grouped_indexes = [*nx.connected_components(G)]
# get a random choice index from each group
filtered_indexes = [random.choice([*_]) for _ in grouped_indexes]
df.loc[filtered_indexes]
Output
A filtered subset of the original tweets DataFrame.
tweet
0 When will this year end? There is only misery ...
2 I am a tweet with no obvious duplicates
5 Tweet tweet!
Approach 2
We can use an unsupervised learning algorithm to cluster strings together, something like k-means This is your bread and butter of unsupervised algorithms, and it has the drawback that you have to know the optimal number of clusters in advance, or more typically work it out via testing. But it has the great advantage that if you're adding more tweets to your dataset, you can quickly apply your clustering model and find similar tweets.
There's a million and one tutorials on how to do this, but the basic approach here would be to (1) clean your text, (2) convert your text into a TFIDF, (3) compute a similarity metric (cosine similarity is common) between each document pair, (4) then train your k-means (or similar) model.
If you're interested in this approach, here's a few random tutorials to follow I found after a quick Google.
https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
http://brandonrose.org/clustering
https://iq.opengenus.org/document-clustering-nlp-kmeans/
...
Hope this helps!

You can use Cosine Similarity between two tweets:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
X =tweet1
Y =tweet2
# tokenization
X_set = word_tokenize(X)
Y_set = word_tokenize(Y)
l1 =[];l2 =[]
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in vector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)
if cosine>=0.90:
print("Similar")

How to build a content-based recommender system that uses multiple attributes?

I want to build a content-based recommender system in Python that uses multiple attributes to decide whether two items are similar. In my case, the "items" are packages hosted by the C# package manager (example) that have various attributes such as name, description, tags that could help to identify similar packages.
I have a prototype recommender system here that currently uses only a single attribute, the description, to decide whether packages are similar. It computes TF-IDF rankings for the descriptions and prints out the top 10 recommendations based on that:
# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
def train(dataframe):
tfidf = TfidfVectorizer(analyzer='word',
ngram_range=(1, 3),
min_df=0,
stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
for idx, row in dataframe.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
for i in similar_indices]
id = row['id']
similar_items = [it for it in similar_items if it[0] != id]
# This 'sum' is turns a list of tuples into a single tuple:
# [(1,2), (3,4)] -> (1,2,3,4)
flattened = sum(similar_items, ())
try_print("Top 10 recommendations for %s: %s" % (id, flattened))
How can I combine cosine_similarities with other similarity measures (based on same author, similar names, shared tags, etc.) to give more context to my recommendations?

For some context, my work with content-based recommenders has revolved primarily around raw text and categorical data/features. Here's a high-level approach I've taken that has worked out nicely and is pretty simple to implement.
Suppose I have three feature columns that I can potentially use to make recommendations: description, name, and tags. To me, the path of least resistance entails combining these three feature sets in a useful way.
You're off to a good start, using TF-IDF to encode description. So why not treat name and tags in a similar way by creating a feature "corpus" consisting of description, name, and tags? Literally, this would mean concatenating the contents of each of the three columns into one long text column.
Be wise about the concatenation, though, as it's probably to your advantage to preserve from which column a given word comes from, in the case of features like name and tag, which are assumed to have much lower cardinality than description. To put it more explicitly: instead of just creating your corpus column like this:
df['corpus'] = (pd.Series(df[['description', 'name', 'tags']]
.fillna('')
.values.tolist()
).str.join(' ')
You might try preserving information about where particular data points in name and tags come from. Something like this:
df['name_feature'] = ['name_{}'.format(x) for x in df['name']]
df['tags_feature'] = ['tags_{}'.format(x) for x in df['tags']]
And after you do that, I would take things a step further by considering how the default tokenizer (which you're using above) works in TfidfVectorizer. Suppose you have the name of a given package's author: "Johnny 'Lightning' Thundersmith". If you just concatenate that literal string, the tokenizer will split it up and roll each of "Johnny", "Lightning", and "Thundersmith" into separate features, which could potentially diminish the information added by that row's value for name. I think it's best to try to preserve that information. So I would do something like this to each of your lower-cardinality text columns (e.g. name or tags):
def raw_text_to_feature(s, sep=' ', join_sep='x', to_include=string.ascii_lowercase):
def filter_word(word):
return ''.join([c for c in word if c in to_include])
return join_sep.join([filter_word(word) for word in text.split(sep)])
def['name_feature'] = df['name'].apply(raw_text_to_feature)
The same sort of critical thinking should be applied to tags. If you've got a comma-separated "list" of tags, you'll probably have to parse those individually and figure out the right way to use them.
Ultimately, once you've got all of your <x>_feature columns created, then you can create your final "corpus" and plug that into your recommender system as inputs.
This whole system takes some engineering, to be sure, but I've found it's the easiest way to introduce new information from other columns that have different cardinalities.

As I understand your question, there are two ways this can be done:
Combine the other features with tfidf_matrix and then calculate the cosine similarity
Calculate the similarity of other features using other methods and then somehow combine them with the cosine similarity of tfidf_matrix to get a meaningful metric.
I was talking about the first one.
For example lets say, for your data, the tfidf_matrix (for only the 'description' column) is of shape (3000, 4000)
where 3000 are the rows in the data and 4000 are the unique words (vocabulary) found by the TfidfVectorizer.
Now lets say you do some feature processing on the other columns ('authors', 'id' etc) and that produces 5 columns. So the shape of that data is (3000, 5).
I was saying to combine the two matrices (combine the columns) so that the new shape of your data is (3000, 4005) and then calculate the cosine_similarity.
See below example:
from scipy import sparse
# This is your original matrix
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
# This is the other features
other_matrix = some_processing_on_other_columns()
combined_matrix = sparse.hstack((tfidf_matrix, other_matrix))
cosine_similarities = linear_kernel(combined_matrix, combined_matrix)

You have a vector for a user $\gamma_u$ and an item $\gamma_i$. The scoring function for your recommendation is:
Right now you said your feature vector has only 1 item, but once you get more, this model will scale for that.
In this case you already engineered your vectors, but typically in recommenders, the feature are learned through matrix factorization. This is called a latent factor model, whereas you have a hand-crafted model.

Best match for input query from a set of documents

I have 8 documents and I ran TF-IDF on it to get an array. I don't understand how I find out which is the best document match for a given input query?
all_documents = [doc1, doc2, ...., doc7]
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents).toarray()

Transform the input to tf-idf format using TfidfVectorizer. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.
Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf object that you created has an attribute vocabulary_ that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.
Example
Document1: dogs are cute
Document2: cats are awful
Leads to a vocabulary of [dogs, cats, are, cute, awful]. A query containing other words than these 5 cannot be used. For example if your query is cute animals, the animals has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0] since cute is the only word that can be found in the documents.

Python - tf-idf predict a new document similarity

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea.
How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.
The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.
Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.

You should take a look at gensim. Example starting code looks like this:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
At prediction time you first get the vector for the new doc:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]
Then get the similarities (sorted by most similar):
sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).

For huge data sets, there is a solution called Text Clustering By Concept. search engines use this Technic,
At first step, you cluster your documents to some groups(e.g 50 cluster), then each cluster has a representative document(that contain some words that have some useful information about it's cluster)
At second step, for calculating cosine similarity between New Document and you data set, you loop through all representative(50 numbers) and find top near representatives(e.g 2 representative)
At final step, you can loop through all documents in selected representative and find nearest cosine similarity
With this Technic, you can reduce the number of loops and improve performace,
You can read more tecnincs in some chapter of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Matching words and vectors in gensim Word2Vec model

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Problem:
Match words to embedding vectors created by Word2Vec () -- how do I do it?
My approach:
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.
If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:
import gensim
documents = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)
which outputs:
[('survey', 1.0000001192092896)]
where the number represents the similarity.
However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.

So I found an easy way to do this, where nmodel is the name of your model.
#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)
#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]
This is based on the gensim code. For older versions of gensim, you might need to drop the wv after the model.

As #bpachev mentioned, gensim does have an option of searching by vector, namely similar_by_vector.
It however implements a brute force linear search, i.e. computes cosine similarity between given vector and vectors of all words in vocabulary, and gives off the top neighbours. An alternate option, as mentioned in the other answer is to use an approximate nearest neighbour search algorithm like FLANN.
Sharing a gist demonstrating the same:
https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.