This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The π₯ -axis is a document count (π₯π) and the π¦ -axis is the
percentage of words that appear less than (π₯π) times. For example,
at π₯=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.
I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)
Related
I have a corpus of 250k Dutch news articles 2010-2020 to which I've applied word2vec models to uncover relationships between sets of neutral words and dimensions (e.g. good-bad). Since my aim is also to analyze the prevalence of certain topics over time, I was thinking of using doc2vec instead so as to simultaneously learn word and document embeddings. The 'prevalence' of topics in a document could then be calculated as the cosine similarities between doc vectors and word embeddings (or combinations of word vectors). In this way, I can calculate the annual topical prevalence in the corpus and see whether there's any changes over time. An example of such an approach can be found here.
My issue is that the avg. yearly cosine similarities yield really strange results. As an example, the cosine similarities between document vectors and a mixture of keywords related to covid-19/coronavirus show a decrease in topical prevalence since 2016 (which obviously cannot be the case).
My question is whether the approach that I'm following is actually valid. Or that maybe there's something that I'm missing. A 250k documents and 100k + vocabulary should be sufficient enough?
Below is the code that I've written:
# Doc2Vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_docs)]
d2vmodel = Doc2Vec(docs, min_count = 5, vector_size = 200, window = 10, dm = 1)
docvecs = d2vmodel.docvecs
wordvecs = d2vmodel.wv
# normalize vector
from numpy.linalg import norm
def nrm(x):
return x/norm(x)
# topical prevalence per doc
def topicalprevalence(topic, docvecs, wordvecs):
proj_lst = []
for i in range(0, len(docvecs)):
topic_lst = []
for j in topic:
cossim = nrm(docvecs[i]) # nrm(wordvecs[j])
topic_lst.append(cossim)
topic_avg = sum(topic_lst) / len(topic_lst)
proj_lst.append(topic_avg)
topicsyrs = {
'topic': proj_lst,
'year': df['datetime'].dt.year
}
return pd.DataFrame(topicsyrs)
# avg topic prevalence per year
def avgtopicyear(topic, docvecs, wordvecs):
docs = topicalprevalence(topic, docvecs, wordvecs)
return pd.DataFrame(docs.groupby("year")["topic"].mean())
# run
covid = ['corona', 'coronapandemie', 'coronacrisis', 'covid', 'pandemie']
covid_scores = topicalprevalence(covid, docvecs, wordvecs)
The word-vec-to-doc-vec relatioships in modes that train both are interesting, but a bit hard to characterize as to what they really mean. In a sense the CBOW-like mode of dm=1 (PV-DM) mixes doc-vectors in as one equal word among the whole window, when training to predict the 'target' word. But in the skip-gram-mixed mode dm=0, dbow_words=1, there'll be window count context-word-vec-to-target-word pair cycles to every 1 doc-vec-to-target-word pair cycle, changing the relative weight.
So if you saw a big improvement in dm=0, dbow_words=1, it might also be because that made the model relatively more word-to-word trained. Varying window is another way to change that balance, or increase epochs, in plain dm=1 mode β which should also result in doc/word compatible training, though perhaps not at the same rate/balance.
Whether a single topicalprevalence() mean vector for a full year would actually be reflective of individual word occurrences for a major topic may or may not be a valid conjecture, depending on possible other changes in the training data. Something like a difference in the relative mix of other major categories in the corpus might swamp even a giant new news topic. (EG: what if in y2020 some new section or subsidiary with a different focus, like entertainment, launched? It might swamp the effects of other words, especially when compressing down to a single vector of some particular dimensionality.)
Someting like a clustering of the year's articles, and identification of the closest 1 or N clusters to the target-words, with their similarities, might be more reflective even if the population of articles in changing. Or, a plot of each year's full set of articles as a histogram-of-similarities to the target-words - which might show a 'lump' of individual articles (not losing their distinctiveness to a full-year average) developing, over time, closer to the new phenomenon.
Turns out that setting parameters to dm=0, dbow_words=1 allows for training documents and words in the same space, now yielding valid results.
I need to generate an embedding matrix to use instead of the layer. I know a priori the similarity between the 10 features (all equidistant from each other) and I can't generate the matrices through training because I don't have enough data.
To do this I have to generate 10 vectors of arbitrary size (ie 10) but which all have the same size and which are all equidistant from each other, with values of the single dimensions being numbers between -1 and 1, all this in python.
Anyone know how this can be done?
I believe you have some words as features and you want to represent them as embedding vectors.
There are several ways to create word embeddings, I will mention a few of these from the simplest to the complex yet very powerful methods.
1. Count Vector.
It is a method of creating vectors out of your unique tokens. For example, if the vocabulary contains three words, say ["and", "basketball", "more"] , then thetext "more and more" will be mapped to the vector [1, 0, 2] : the word "and" appears once, the word "basketball" does not appear at all, and the word "more" appears twice. This text representation is called a bag of words, since it completely loses the order of the words.
2. TF-IDF( Term Frequency Inverse document frequency)
The problem with the count vector is that it ignores the important
word since it is having less appearance compared to common words. In the above example the term "basketball" is ignored and "more" is given importance. To overcome this TF-IDF approach is best suited, For example,
letβs imagine that the words "and" , "basketball" , and "more" appear respectively in 200, 10, and 100 text instances in the training set: in this case, the final vector will be [1/log(200), 0/log(10), 2/log(100)] , which is approximately equal to [0.19, 0.,
0.43] .
3. Pre-trained word vectors.
These are the embedding vectors trained on millions of text data available on Wikipedia or other general sources, it will be having all the common terms available in English. There are many open-sourced pre-trained word vectors are available some of the, are.
GoogleNews vector.
GloVe
fastText by Facebook.
You can choose the vector dimension based on the model's availability, for example you can choose 50,100,200,300 dimensions vector for each word.
from gensim.models import Word2Vec
#loading the downloaded model
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, norm_only=True)
#the model is loaded. It can be used to perform all of the tasks mentioned above.
# getting word vectors of a word
dog = model['dog']
For more details and other methods to create word embeddings, you can refer to this beautiful article written by NCC.
Hope this answers your question, Happy Learning!
I have a doubt in word2vec and word embedding , I have downloaded GloVe pre-trained word embedding (shape 40,000 x 50) and using this function to extract information from that:
import numpy as np
def loadGloveModel(gloveFile):
print ("Loading Glove Model")
f = open(gloveFile,'r')
model = {}
for line in f:
splitLine = line.split()
word = splitLine[0]
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
print ("Done.",len(model)," words loaded!")
return model
Now if I call this function for word 'python' something like :
print(loadGloveModel('glove.6B.100d.txt')['python'])
it gives me 1x50 shape vector like this:
[ 0.24934 0.68318 -0.044711 -1.3842 -0.0073079 0.651
-0.33958 -0.19785 -0.33925 0.26691 -0.033062 0.15915
0.89547 0.53999 -0.55817 0.46245 0.36722 0.1889
0.83189 0.81421 -0.11835 -0.53463 0.24158 -0.038864
1.1907 0.79353 -0.12308 0.6642 -0.77619 -0.45713
-1.054 -0.20557 -0.13296 0.12239 0.88458 1.024
0.32288 0.82105 -0.069367 0.024211 -0.51418 0.8727
0.25759 0.91526 -0.64221 0.041159 -0.60208 0.54631
0.66076 0.19796 -1.1393 0.79514 0.45966 -0.18463
-0.64131 -0.24929 -0.40194 -0.50786 0.80579 0.53365
0.52732 0.39247 -0.29884 0.009585 0.99953 -0.061279
0.71936 0.32901 -0.052772 0.67135 -0.80251 -0.25789
0.49615 0.48081 -0.68403 -0.012239 0.048201 0.29461
0.20614 0.33556 -0.64167 -0.64708 0.13377 -0.12574
-0.46382 1.3878 0.95636 -0.067869 -0.0017411 0.52965
0.45668 0.61041 -0.11514 0.42627 0.17342 -0.7995
-0.24502 -0.60886 -0.38469 -0.4797 ]
I need help in understanding the output matrix. What does these value represent and there significance in generating new word
In usual word2vec/GLoVe, the individual per-dimension coordinates don't specifically mean anything. The training process instead forces words to be in valuable/interesting relative positions against each other.
All meaning is in the relative distances and relative directions, not specifically aligned with exact coordinate axes.
Consider a classic illustrative example: the ability of word-vectors to solve an analogy like "man is to king as woman is to ?" β by finding the work queen near some expected point in the coordinate-space.
There will be neighborhoods of the word-vector space that include lots of related words of one type (man, men, male, boy, etc. - or king, queen, prince, royal, etc.). And further, there may be some directions that match well with human ideas of categories and magnitude (more woman-like, more-monarchical, higher-ranked, etc.). But these neighborhoods and directions generally are not 1:1 correlated with exact axis-dimensions of the space.
And further, there are many possible near rotations/reflections/transformations of a space full of word-vectors which are just-as-good as each other for typical applications, but totally different in their exact coordinates for each word. That is, all the expected relative distances are similar β words have the 'right' neighbors, in the right ranked order β and there are useful directional patterns. But the individual words in each have no globally 'right' or consistent position β just relatively useful positions.
Even if in one set of vectors there appears to be some vague correlation β like "high values in dimension 21 correlate with idea of 'maleness' β it's likely to be a coincidence of that vector-set, not a reliable relationship.
(There are some alternate techniques which try to force the individual dimensions to be mapped to more-interpretable concepts β see as one example NNSE β but their use seems less common.)
Here is a nice article explaining the underlying intuition and meaning of word2vec vectors.
https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
There's no universal way to know exactly what an embedding means, the results discussed in that article were discovered by looking at many embeddings where one value is varied, and noting the differences. Each word2vec model will come up with its own unique embedding. The individual values of the embedding have some semantic meaning in the language.
What word2vec is giving you is converting a sparse one-hot vector representing each word in a dictionary of potentially millions of words into a small dense vector where each value has some semantic meaning in the language. Large sparse inputs are usually bad for learning, small, dense, meaningful inputs are usually good.
In a nutshell a vector word in word embedding represents words' contexts. Then, it "embeds" the meaning because "similar words have similar contexts". Furthermore, you can use this idea to extend to "whatever embedding" just train a neural network with a lot of context of something (sentence, paragraph, documents, images and son on) the resulting vector of dimensions d will content a valuable representation of your objects.
This is a good post to get a complete landscape
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)
The output is as follows I am not getting how this matrix has been calculated.please discuss the example shown in the code. I have read one more example from the Wikipedia but could not understand.
The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.
As for how the calculation is done, you can have a look at the official documentation here.
The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.
Basically, the steps are as follow:
Step1 - Collect all different terms from all the documents present in fit().
For your data, they are
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
This is available from vectorizer.get_feature_names()
Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.
In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
First 1 1 1 1 1 0 1
Sec 0 1 1 0 0 1 0
You can get the above result by calling X.toarray().
In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.
<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1
<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1
<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.
I found the code in this post to be very helpful. (I would add a comment to that post but I need 50 reputation points.)
I used the same code in the post above but added a test document I have been using for debugging my own clustering code. For some reason a word in 1 document appears in both clusters.
The code is:
Update: I added "Unique sentence" to the documents below.
documents = ["I ran yesterday.",
"The sun was hot.",
"I ran yesterday in the hot and humid sun.",
"Yesterday the sun was hot.",
"Yesterday I ran in the hot sun.",
"Unique sentence." ]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
#cluster documents
true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
#print top terms per cluster clusters
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print ("Cluster %d:" % i,)
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind])
print
The output I receive is:
UPDATE: I updated the output below to reflect the "unique sentence" above.
Cluster 0:
sun
hot
yesterday
ran
humid
unique
sentence
Cluster 1:
unique
sentence
yesterday
sun
ran
humid
hot
You'll note that "humid" appears as a top term in both clusters even though it's in just 1 line of the documents above. I would expect a unique word, like "humid" in this case, to be a top term in just 1 of the clusters.
Thanks!
TF*IDF tells you the representativeness of a word (in this case the column) for a specific document (and in this case the row). By representative I mean: a word occurs frequently in one document but not frequently in other documents. The higher the TF*IDF value, the more this word represents a specific document.
Now let's start to understand the values you actually work with. From sklearn's kmeans you use the return variable cluster_centers. This gives you the coordinates of each cluster, which are an array of TF*IDF weights, for each word one. It is important to note that these are just some abstract form of word frequency and do no longer relate back to a specific document. Next, numpy.argsort() gives you the indices that would sort an array, starting with the index for the lowest TF*IDF value. So after that you reverse its order with [:, ::-1]. Now you have the index of the most representative words for that cluster center at the beginning.
Now, let's talk a bit more about k-means. k-means initialises it's k-cluster centers randomly. Then the each document is assigned to a center and then the cluster centers are recomputed. This is repeated until the optimization criterion to minimize the sum of sqaured distances between documents and their closest center is met. What this means for you is that each cluster dimension most likely doesn't have the TF*IDF value 0 because of the random initialisation. Furthermore, k-means stops as soon as the optimization criterion is met. Thus, TF*IDF values of a center mean just that the TF*IDF of the documents that were assigned to the other clusters are closer to this center than to the other cluster centers.
One additional bit is that with order_centroids[i, :10], the 10 most representative words for each cluster are printed, but since you have only 5 words in total, all words will be printed either way just in a different order.
I hope this helped. By the way k-means does not guarantee you to find the global optimum and might get stuck in a local optimum, that's why it is usually run multiple times with different random starting points.
Not necessarily. The code you are using creates vector space of the bag of words (excluding stop words) of your corpus (I am ignoring the tf-idf weighting.). Looking at your documents, your vector space is of size 5, with the a word array like (ignoring the order):
word_vec_space = [yesterday, ran, sun, hot, humid]
Each document is assigned a numeric vector of whether it contains the words in 'word_vec_space'.
"I ran yesterday." -> [1,1,0,0,0]
"The sun was hot." -> [0,0,1,1,0]
...
When performing k-mean clustering, you pick k starting points in the vector-space and allow the points to move around to optimize the clusters. You ended up with both cluster centroids containing the a non-zero value for 'humid'. This is due to the one sentence that contains 'humid' also had 'sun', 'hot', and 'yesterday'.
Why would clusters have distinct top terms?
Consider the clustering worked (very often it doesn't - beware), would you consider these clusters to be bad or good:
banana fruit
apple fruit
apple computer
windows computer
window blinds
If I would ever get such clusters, I wouldmbe happy (iknfact, I would believe I am seeing an error, because these are much too god results. Text clustering is always borderline to non-working).
With text clusters, it's a lot about word combinations, not just single words. Apple fruit and apple computer are not the same.