how to calculate term-document matrix? - python

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)
The output is as follows I am not getting how this matrix has been calculated.please discuss the example shown in the code. I have read one more example from the Wikipedia but could not understand.

The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.
As for how the calculation is done, you can have a look at the official documentation here.
The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.
Basically, the steps are as follow:
Step1 - Collect all different terms from all the documents present in fit().
For your data, they are
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
This is available from vectorizer.get_feature_names()
Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.
In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
First 1 1 1 1 1 0 1
Sec 0 1 1 0 0 1 0
You can get the above result by calling X.toarray().
In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.
<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1
<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1
<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.

Related

How can I get a fixed size vector from Byte Pair Encoding (using pre-trained BPEmb)?

I want to get vectors of words of a document using pre-trained BPEmb. In a paper, the author used this api to break words into maximum of 3 subwords, get 100 dimensional embedding vector for each subword and finally, have 300 dimensional vector for each word. Here I have two questions. How can I define the maximum number of subwords in BPEmb? The api will decide on the number of tokens and I couldn't find out how the author defined maximum tokens. The second question is when there are less than 3 subwords, how can I make the output as 300 dimensional vector for each word? For example, if it has 2 subwords, it will create 200 dimensional vector.
I have doubt about padding, that it might change the vector of the word to something else, with a different meaning. Here is a simple code just to show the api I used.
`
from bpemb import BPEmb
bpemb_en = BPEmb(lang='en', vs=200000)
print(bpemb_en.embed('Something'))
`

Naive Bayes, Text Analysis, SKLearn

This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The 𝑥 -axis is a document count (𝑥𝑖) and the 𝑦 -axis is the
percentage of words that appear less than (𝑥𝑖) times. For example,
at 𝑥=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.
I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)

How to get word count from TF*IDF value in sklearn

I want to get the count of a word in a given sentence using only tf*idf matrix of a set of sentences. I use TfidfVectorizer from sklearn.feature_extraction.text.
Example :
from sklearn.feature_extraction.text import TfidfVectorizer
sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
tfidf_matrix = vect.fit_transform(sentences).toarray()
I want to be able to calculate the number of times the term "sun" occurs in the first sentence (which is 2) using only tfidf_matrix[0] and probably vect.idf_ .
I know there are infinite ways to get term frequency and words count but I have a special case where I only have a tfidf matrix.
I already tried to divide the tfidf value of the word "sun" in the first sentence by its idf value to get tf. Then I multiplied tf by the total number of words in the sentence to get the words count. Unfortunately, I get wrong values.
The intuitive thing to do would be exactly what you tried: multiply each tf value by the number of words in the sentence you're examining. However, I think the key observation here is that each row has been normalized by its euclidean length. So multiplying each row by the number of words in that sentence is at best approximating the denormalized row, which is why you get weird values. AFAIK, you can't denormalize the tf*idf matrix without knowing the norms of each of the original rows ahead of time. This is primarily because there are an infinite number of vectors that can be mapped to any one normalized vector. So without the norms, you can't retrieve the correct magnitude of the original vector. See this answer for more details about what I mean.
That being said, I think there's a workaround in our case. We can at least retrieve the normalized ratios of the term counts in each sentence, i.e., sun appears twice as much as shiny. I found that normalizing each row so that the sum of the tf values is 1 and then multiplying those values by the length of the stopword-filtered sentences seems to retrieve the original word counts.
To demonstrate:
sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
mat = vect.fit_transform(sentences).toarray()
q = mat / vect.idf_
sums = np.ones((q.shape[0], 1))
lens = np.ones((q.shape[0], 1))
for ix in xrange(q.shape[0]):
sums[ix] = np.sum(q[ix,:])
lens[ix] = len([x for x in sentences[ix].split() if unicode(x) in vect.get_feature_names()]) #have to filter out stopwords
sum_to_1 = q / sums
tf = sum_to_1 * lens
print tf
yields:
[[ 1. 0. 1. 1. 2.]
[ 0. 1. 0. 0. 1.]]
I tried this with a few more complicated sentences and it seems to work alright. Let me know if I missed anything.

How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?

I am using scikit learn's Kmeans algorithm to cluster comments.
sentence_list=['hello how are you', "I am doing great", "my name is abc"]
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
when i print the output of vectorized, it gives me the index of the words and the tf-idf scores of the index.
So im wondering, given we only get the tfidf scores of words, how is it that we manage to cluster documents based on individual words and not the score of an entire document? Or maybe it does this..Can someone explain to me the concept behind this?
You should take a look at how the Kmeans algorithm works. First the stop words never make it to vectorized, therefore are totally ignored by Kmeans and don't have any influence in how the documents are clustered. Now suppose you have:
sentence_list=["word1", "word2", "word2 word3"]
Lets say you want 2 clusters. In this case you expect the second and the third document to be in the same cluster because they share a common word. Lets see how this happens.
The numeric representation of the docs vectorized looks like:
word1 word3 word2
1 0.000000 0.000000 # doc 1
0 1.000000 0.000000 # doc 2
0 0.605349 0.795961 # doc 3
In the first step of Kmeans, some centroids are randomly chosen from the data, for example, the document 1 and the document 3 will be the initial centroids:
Centroid 1: [1, 0.000000, 0.000000]
Centroid 2: [0, 0.605349, 0.795961]
Now if you calculate the distances from every point(document) to each one of the two centroids, you will see that:
document 1 has distance 0 to centroid 1 so it belongs to centroid 1
document 3 has distance 0 to centroid 2 so it belongs to centroid 2
Finally we calculate the distance between the remaining document 2 and each centroid to find out which one it belongs to:
>>> from scipy.spatial.distance import euclidean
>>> euclidean([0, 1, 0], [1, 0, 0]) # dist(doc2, centroid1)
1.4142135623730951
>>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2)
0.8884272507056005
So the 2nd document and the second centroid are closer, this means that the second document is assigned to the 2nd centroid.
TF/IDF is a measure that calculates the importance of a word in a document with respect to the rest of the words in that document. It does not compute the importance of a standalone word. (and it makes sense, right? Because importance always means priviledge over others!). So TF/IDF of each word is actually an importance measure of a document with respect to the word.
I don't see where TF/IDF is used in your code. However, it is possible to compute the kmeans algorithm with TF/IDF scores used as features. Also, clustering for the three sample documents that you have mentioned is simply impossible, while no two documents there have a common word!
Edit 1: First of all, if the word 'cat' occurs in two documents it is possible that they would be clustered together (depending on other words in the two documents and also other documents). Secondly, you should learn more about k-means. You see, kmeans uses features to cluster documents together, and each tf/idf score for each word in a document is a feature measure that's been used to compare that document against others on a corpus.

Is the idf for query same as idf for documents?

This is part of my code.
idf=self.getInverseDocFre(word) ##this idf is from the collection
qi=count*idf
di=self.docTermCount[docid][word]*idf
similiarity+=qi*di
similiarity/=self.docSize[docid]
this is wikipedia
https://en.wikipedia.org/wiki/Vector_space_model#Example:_tf-idf_weights
this is an example from web
http://www.site.uottawa.ca/~diana/csi4107/cosine_tf_idf_example.pdf
My question is that, if idf for the query is the same idf from the collection?
Is that why I have to multiply the idf for the similiarity twice?
I am afraid that I am wrong about the concept of idf for the query part.
You have to represent your query in the same space as the documents of your collection, i.e. that transformation of words->vectors has to be the same for both, the documents and the query, otherwise you would be comparing apples to oranges. This transformation is fixed once you have extracted the terms and calculate the IDFs from the collection. Once you have this you can represent new word documents in this representation.
Imagine that your query is exactly one of your documents(d2 for example):
d2 = [0 0 0.584 1.584 0 0.584] # new york post
query = [0 0 1 1 0 1] # new york post
In this case you expect the similarity to be one. There is no way this is going to happen if you don't multiply the query TFs by the corresponding IDFs(which you got from the collection). A vector that has only counts(term frequencies) is not going to be parallel to a vector that has each component multiplied by its corresponding idf (except in the special case that all idf are equal). That is why you have to multiply the query too, because the documents have already been multiplied.

Categories