Clustering text documents and obtaining duplicate top terms

Clustering text documents and obtaining duplicate top terms - python

I found the code in this post to be very helpful. (I would add a comment to that post but I need 50 reputation points.)
I used the same code in the post above but added a test document I have been using for debugging my own clustering code. For some reason a word in 1 document appears in both clusters.
The code is:
Update: I added "Unique sentence" to the documents below.
documents = ["I ran yesterday.",
"The sun was hot.",
"I ran yesterday in the hot and humid sun.",
"Yesterday the sun was hot.",
"Yesterday I ran in the hot sun.",
"Unique sentence." ]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
#cluster documents
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#print top terms per cluster clusters
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print ("Cluster %d:" % i,)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
print
The output I receive is:
UPDATE: I updated the output below to reflect the "unique sentence" above.
Cluster 0:
sun
hot
yesterday
ran
humid
unique
sentence
Cluster 1:
unique
sentence
yesterday
sun
ran
humid
hot
You'll note that "humid" appears as a top term in both clusters even though it's in just 1 line of the documents above. I would expect a unique word, like "humid" in this case, to be a top term in just 1 of the clusters.
Thanks!

TF*IDF tells you the representativeness of a word (in this case the column) for a specific document (and in this case the row). By representative I mean: a word occurs frequently in one document but not frequently in other documents. The higher the TF*IDF value, the more this word represents a specific document.
Now let's start to understand the values you actually work with. From sklearn's kmeans you use the return variable cluster_centers. This gives you the coordinates of each cluster, which are an array of TF*IDF weights, for each word one. It is important to note that these are just some abstract form of word frequency and do no longer relate back to a specific document. Next, numpy.argsort() gives you the indices that would sort an array, starting with the index for the lowest TF*IDF value. So after that you reverse its order with [:, ::-1]. Now you have the index of the most representative words for that cluster center at the beginning.
Now, let's talk a bit more about k-means. k-means initialises it's k-cluster centers randomly. Then the each document is assigned to a center and then the cluster centers are recomputed. This is repeated until the optimization criterion to minimize the sum of sqaured distances between documents and their closest center is met. What this means for you is that each cluster dimension most likely doesn't have the TF*IDF value 0 because of the random initialisation. Furthermore, k-means stops as soon as the optimization criterion is met. Thus, TF*IDF values of a center mean just that the TF*IDF of the documents that were assigned to the other clusters are closer to this center than to the other cluster centers.
One additional bit is that with order_centroids[i, :10], the 10 most representative words for each cluster are printed, but since you have only 5 words in total, all words will be printed either way just in a different order.
I hope this helped. By the way k-means does not guarantee you to find the global optimum and might get stuck in a local optimum, that's why it is usually run multiple times with different random starting points.

Not necessarily. The code you are using creates vector space of the bag of words (excluding stop words) of your corpus (I am ignoring the tf-idf weighting.). Looking at your documents, your vector space is of size 5, with the a word array like (ignoring the order):
word_vec_space = [yesterday, ran, sun, hot, humid]
Each document is assigned a numeric vector of whether it contains the words in 'word_vec_space'.
"I ran yesterday." -> [1,1,0,0,0]
"The sun was hot." -> [0,0,1,1,0]
...
When performing k-mean clustering, you pick k starting points in the vector-space and allow the points to move around to optimize the clusters. You ended up with both cluster centroids containing the a non-zero value for 'humid'. This is due to the one sentence that contains 'humid' also had 'sun', 'hot', and 'yesterday'.

Why would clusters have distinct top terms?
Consider the clustering worked (very often it doesn't - beware), would you consider these clusters to be bad or good:
banana fruit
apple fruit
apple computer
windows computer
window blinds
If I would ever get such clusters, I wouldmbe happy (iknfact, I would believe I am seeing an error, because these are much too god results. Text clustering is always borderline to non-working).
With text clusters, it's a lot about word combinations, not just single words. Apple fruit and apple computer are not the same.

Related

cosine similarity doc vectors and word vectors for topical prevalence using doc2vec

I have a corpus of 250k Dutch news articles 2010-2020 to which I've applied word2vec models to uncover relationships between sets of neutral words and dimensions (e.g. good-bad). Since my aim is also to analyze the prevalence of certain topics over time, I was thinking of using doc2vec instead so as to simultaneously learn word and document embeddings. The 'prevalence' of topics in a document could then be calculated as the cosine similarities between doc vectors and word embeddings (or combinations of word vectors). In this way, I can calculate the annual topical prevalence in the corpus and see whether there's any changes over time. An example of such an approach can be found here.
My issue is that the avg. yearly cosine similarities yield really strange results. As an example, the cosine similarities between document vectors and a mixture of keywords related to covid-19/coronavirus show a decrease in topical prevalence since 2016 (which obviously cannot be the case).
My question is whether the approach that I'm following is actually valid. Or that maybe there's something that I'm missing. A 250k documents and 100k + vocabulary should be sufficient enough?
Below is the code that I've written:
# Doc2Vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_docs)]
d2vmodel = Doc2Vec(docs, min_count = 5, vector_size = 200, window = 10, dm = 1)
docvecs = d2vmodel.docvecs
wordvecs = d2vmodel.wv
# normalize vector
from numpy.linalg import norm
def nrm(x):
return x/norm(x)
# topical prevalence per doc
def topicalprevalence(topic, docvecs, wordvecs):
proj_lst = []
for i in range(0, len(docvecs)):
topic_lst = []
for j in topic:
cossim = nrm(docvecs[i]) # nrm(wordvecs[j])
topic_lst.append(cossim)
topic_avg = sum(topic_lst) / len(topic_lst)
proj_lst.append(topic_avg)
topicsyrs = {
'topic': proj_lst,
'year': df['datetime'].dt.year
}
return pd.DataFrame(topicsyrs)
# avg topic prevalence per year
def avgtopicyear(topic, docvecs, wordvecs):
docs = topicalprevalence(topic, docvecs, wordvecs)
return pd.DataFrame(docs.groupby("year")["topic"].mean())
# run
covid = ['corona', 'coronapandemie', 'coronacrisis', 'covid', 'pandemie']
covid_scores = topicalprevalence(covid, docvecs, wordvecs)

The word-vec-to-doc-vec relatioships in modes that train both are interesting, but a bit hard to characterize as to what they really mean. In a sense the CBOW-like mode of dm=1 (PV-DM) mixes doc-vectors in as one equal word among the whole window, when training to predict the 'target' word. But in the skip-gram-mixed mode dm=0, dbow_words=1, there'll be window count context-word-vec-to-target-word pair cycles to every 1 doc-vec-to-target-word pair cycle, changing the relative weight.
So if you saw a big improvement in dm=0, dbow_words=1, it might also be because that made the model relatively more word-to-word trained. Varying window is another way to change that balance, or increase epochs, in plain dm=1 mode – which should also result in doc/word compatible training, though perhaps not at the same rate/balance.
Whether a single topicalprevalence() mean vector for a full year would actually be reflective of individual word occurrences for a major topic may or may not be a valid conjecture, depending on possible other changes in the training data. Something like a difference in the relative mix of other major categories in the corpus might swamp even a giant new news topic. (EG: what if in y2020 some new section or subsidiary with a different focus, like entertainment, launched? It might swamp the effects of other words, especially when compressing down to a single vector of some particular dimensionality.)
Someting like a clustering of the year's articles, and identification of the closest 1 or N clusters to the target-words, with their similarities, might be more reflective even if the population of articles in changing. Or, a plot of each year's full set of articles as a histogram-of-similarities to the target-words - which might show a 'lump' of individual articles (not losing their distinctiveness to a full-year average) developing, over time, closer to the new phenomenon.

Turns out that setting parameters to dm=0, dbow_words=1 allows for training documents and words in the same space, now yielding valid results.

Naive Bayes, Text Analysis, SKLearn

This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The 𝑥 -axis is a document count (𝑥𝑖) and the 𝑦 -axis is the
percentage of words that appear less than (𝑥𝑖) times. For example,
at 𝑥=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.

I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)

what actually word embedding dimensions values represent?

I have a doubt in word2vec and word embedding , I have downloaded GloVe pre-trained word embedding (shape 40,000 x 50) and using this function to extract information from that:
import numpy as np
def loadGloveModel(gloveFile):
print ("Loading Glove Model")
f = open(gloveFile,'r')
model = {}
for line in f:
splitLine = line.split()
word = splitLine[0]
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
print ("Done.",len(model)," words loaded!")
return model
Now if I call this function for word 'python' something like :
print(loadGloveModel('glove.6B.100d.txt')['python'])
it gives me 1x50 shape vector like this:
[ 0.24934 0.68318 -0.044711 -1.3842 -0.0073079 0.651
-0.33958 -0.19785 -0.33925 0.26691 -0.033062 0.15915
0.89547 0.53999 -0.55817 0.46245 0.36722 0.1889
0.83189 0.81421 -0.11835 -0.53463 0.24158 -0.038864
1.1907 0.79353 -0.12308 0.6642 -0.77619 -0.45713
-1.054 -0.20557 -0.13296 0.12239 0.88458 1.024
0.32288 0.82105 -0.069367 0.024211 -0.51418 0.8727
0.25759 0.91526 -0.64221 0.041159 -0.60208 0.54631
0.66076 0.19796 -1.1393 0.79514 0.45966 -0.18463
-0.64131 -0.24929 -0.40194 -0.50786 0.80579 0.53365
0.52732 0.39247 -0.29884 0.009585 0.99953 -0.061279
0.71936 0.32901 -0.052772 0.67135 -0.80251 -0.25789
0.49615 0.48081 -0.68403 -0.012239 0.048201 0.29461
0.20614 0.33556 -0.64167 -0.64708 0.13377 -0.12574
-0.46382 1.3878 0.95636 -0.067869 -0.0017411 0.52965
0.45668 0.61041 -0.11514 0.42627 0.17342 -0.7995
-0.24502 -0.60886 -0.38469 -0.4797 ]
I need help in understanding the output matrix. What does these value represent and there significance in generating new word

In usual word2vec/GLoVe, the individual per-dimension coordinates don't specifically mean anything. The training process instead forces words to be in valuable/interesting relative positions against each other.
All meaning is in the relative distances and relative directions, not specifically aligned with exact coordinate axes.
Consider a classic illustrative example: the ability of word-vectors to solve an analogy like "man is to king as woman is to ?" – by finding the work queen near some expected point in the coordinate-space.
There will be neighborhoods of the word-vector space that include lots of related words of one type (man, men, male, boy, etc. - or king, queen, prince, royal, etc.). And further, there may be some directions that match well with human ideas of categories and magnitude (more woman-like, more-monarchical, higher-ranked, etc.). But these neighborhoods and directions generally are not 1:1 correlated with exact axis-dimensions of the space.
And further, there are many possible near rotations/reflections/transformations of a space full of word-vectors which are just-as-good as each other for typical applications, but totally different in their exact coordinates for each word. That is, all the expected relative distances are similar – words have the 'right' neighbors, in the right ranked order – and there are useful directional patterns. But the individual words in each have no globally 'right' or consistent position – just relatively useful positions.
Even if in one set of vectors there appears to be some vague correlation – like "high values in dimension 21 correlate with idea of 'maleness' – it's likely to be a coincidence of that vector-set, not a reliable relationship.
(There are some alternate techniques which try to force the individual dimensions to be mapped to more-interpretable concepts – see as one example NNSE – but their use seems less common.)

Here is a nice article explaining the underlying intuition and meaning of word2vec vectors.
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
There's no universal way to know exactly what an embedding means, the results discussed in that article were discovered by looking at many embeddings where one value is varied, and noting the differences. Each word2vec model will come up with its own unique embedding. The individual values of the embedding have some semantic meaning in the language.
What word2vec is giving you is converting a sparse one-hot vector representing each word in a dictionary of potentially millions of words into a small dense vector where each value has some semantic meaning in the language. Large sparse inputs are usually bad for learning, small, dense, meaningful inputs are usually good.

In a nutshell a vector word in word embedding represents words' contexts. Then, it "embeds" the meaning because "similar words have similar contexts". Furthermore, you can use this idea to extend to "whatever embedding" just train a neural network with a lot of context of something (sentence, paragraph, documents, images and son on) the resulting vector of dimensions d will content a valuable representation of your objects.
This is a good post to get a complete landscape
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

Doc2Vec Sentence Clustering

I have multiple documents that contain multiple sentences. I want to use doc2vec to cluster (e.g. k-means) the sentence vectors by using sklearn.
As such, the idea is that similar sentences are grouped together in several clusters. However, it is not clear to me if I have to train every single document separately and then use a clustering algorithm on the sentence vectors. Or, if I could infer a sentence vector from doc2vec without training every new sentence.
Right now this is a snippet of my code:
sentenceLabeled = []
for sentenceID, sentence in enumerate(example_sentences):
sentenceL = TaggedDocument(words=sentence.split(), tags = ['SENT_%s' %sentenceID])
sentenceLabeled.append(sentenceL)
model = Doc2Vec(size=300, window=10, min_count=0, workers=11, alpha=0.025,
min_alpha=0.025)
model.build_vocab(sentenceLabeled)
for epoch in range(20):
model.train(sentenceLabeled)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
textVect = model.docvecs.doctag_syn0
## K-means ##
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(textVect)
clusters = km.labels_.tolist()
## Print Sentence Clusters ##
cluster_info = {'sentence': example_sentences, 'cluster' : clusters}
sentenceDF = pd.DataFrame(cluster_info, index=[clusters], columns = ['sentence','cluster'])
for num in range(num_clusters):
print()
print("Sentence cluster %d: " %int(num+1), end='')
print()
for sentence in sentenceDF.ix[num]['sentence'].values.tolist():
print(' %s ' %sentence, end='')
print()
print()
Basically, what I am doing right now is training on every labeled sentence in the document. However, if have the idea that this could be done in a simpler way.
Eventually, the sentences that contain similar words should be clustered together and be printed. At this point training every document separately, does not clearly reveal any logic within the clusters.
Hopefully someone can steer me in the right direction.
Thanks.

have you looked at the word vectors you get (use DM=1 algorithm setting) ? Do these show good similarities when you inspect them?
I would try using tSNE to reduce down your dimensions once you have got some reasonable looking similar word vectors working. You can use PCA first to do that to reduce to say 50 or so dimensions if you need to . Think both are in sklearn. Then see if your documents are forming distinct groups or not like that.
also look at your most_similar() document vectors and try infer_vector() on a known trained sentence and you should get a very close similarity to 1 if all is well. (infer_vector() is always a bit different result each time, so never identical!)

For your use case, all sentences in all documents should be trained together. Essentially, you should treat sentence as mini-document. Then they all share the same vocabulary and semantic space.

How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?

I am using scikit learn's Kmeans algorithm to cluster comments.
sentence_list=['hello how are you', "I am doing great", "my name is abc"]
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
when i print the output of vectorized, it gives me the index of the words and the tf-idf scores of the index.
So im wondering, given we only get the tfidf scores of words, how is it that we manage to cluster documents based on individual words and not the score of an entire document? Or maybe it does this..Can someone explain to me the concept behind this?

You should take a look at how the Kmeans algorithm works. First the stop words never make it to vectorized, therefore are totally ignored by Kmeans and don't have any influence in how the documents are clustered. Now suppose you have:
sentence_list=["word1", "word2", "word2 word3"]
Lets say you want 2 clusters. In this case you expect the second and the third document to be in the same cluster because they share a common word. Lets see how this happens.
The numeric representation of the docs vectorized looks like:
word1 word3 word2
1 0.000000 0.000000 # doc 1
0 1.000000 0.000000 # doc 2
0 0.605349 0.795961 # doc 3
In the first step of Kmeans, some centroids are randomly chosen from the data, for example, the document 1 and the document 3 will be the initial centroids:
Centroid 1: [1, 0.000000, 0.000000]
Centroid 2: [0, 0.605349, 0.795961]
Now if you calculate the distances from every point(document) to each one of the two centroids, you will see that:
document 1 has distance 0 to centroid 1 so it belongs to centroid 1
document 3 has distance 0 to centroid 2 so it belongs to centroid 2
Finally we calculate the distance between the remaining document 2 and each centroid to find out which one it belongs to:
>>> from scipy.spatial.distance import euclidean
>>> euclidean([0, 1, 0], [1, 0, 0]) # dist(doc2, centroid1)
1.4142135623730951
>>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2)
0.8884272507056005
So the 2nd document and the second centroid are closer, this means that the second document is assigned to the 2nd centroid.

TF/IDF is a measure that calculates the importance of a word in a document with respect to the rest of the words in that document. It does not compute the importance of a standalone word. (and it makes sense, right? Because importance always means priviledge over others!). So TF/IDF of each word is actually an importance measure of a document with respect to the word.
I don't see where TF/IDF is used in your code. However, it is possible to compute the kmeans algorithm with TF/IDF scores used as features. Also, clustering for the three sample documents that you have mentioned is simply impossible, while no two documents there have a common word!
Edit 1: First of all, if the word 'cat' occurs in two documents it is possible that they would be clustered together (depending on other words in the two documents and also other documents). Secondly, you should learn more about k-means. You see, kmeans uses features to cluster documents together, and each tf/idf score for each word in a document is a feature measure that's been used to compare that document against others on a corpus.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.