Is the idf for query same as idf for documents? - python

This is part of my code.
idf=self.getInverseDocFre(word) ##this idf is from the collection
qi=count*idf
di=self.docTermCount[docid][word]*idf
similiarity+=qi*di
similiarity/=self.docSize[docid]
this is wikipedia
https://en.wikipedia.org/wiki/Vector_space_model#Example:_tf-idf_weights
this is an example from web
http://www.site.uottawa.ca/~diana/csi4107/cosine_tf_idf_example.pdf
My question is that, if idf for the query is the same idf from the collection?
Is that why I have to multiply the idf for the similiarity twice?
I am afraid that I am wrong about the concept of idf for the query part.

You have to represent your query in the same space as the documents of your collection, i.e. that transformation of words->vectors has to be the same for both, the documents and the query, otherwise you would be comparing apples to oranges. This transformation is fixed once you have extracted the terms and calculate the IDFs from the collection. Once you have this you can represent new word documents in this representation.
Imagine that your query is exactly one of your documents(d2 for example):
d2 = [0 0 0.584 1.584 0 0.584] # new york post
query = [0 0 1 1 0 1] # new york post
In this case you expect the similarity to be one. There is no way this is going to happen if you don't multiply the query TFs by the corresponding IDFs(which you got from the collection). A vector that has only counts(term frequencies) is not going to be parallel to a vector that has each component multiplied by its corresponding idf (except in the special case that all idf are equal). That is why you have to multiply the query too, because the documents have already been multiplied.

Related

Naive Bayes, Text Analysis, SKLearn

This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The 𝑥 -axis is a document count (𝑥𝑖) and the 𝑦 -axis is the
percentage of words that appear less than (𝑥𝑖) times. For example,
at 𝑥=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.
I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)

tfidf oucomes are different for the exact same word

I'm running tfidf model in python.
texts=[**tokenized words**]
dictionary = corpora.Dictionary(texts)
corpus = list(map(dictionary.doc2bow,texts))
test_model = models.TfidfModel(corpus)
corpus_tfidf = test_model[corpus]
And it returns the output which gives some patterns of values to the exact same word.
For example, I chose the word "AAA".
key score
0 "AAA" 1
2323 "AAA" 0.896502
4086 "AAA" 0.844922
Why do they have every different value even though they are exact same.
TFIDF stands for Term Frequency inverse document frequency. This means that for every token in each document a TFIDF vectorisation will first count the frequency of the token in the document. Then it will inversely weight the token frequency by the proportion of documents that also have the token in them.
The result is that every token in each document will have a value that reflects its significance to that particular document, negatively weighted by its presence across all documents.
Some TFIDF processors may also add an extra dimension of weighting based on how many other tokens are in each document.
In short the same token has different scores in different documents because that token probably occurs more prevalently in some documents than others. This prevalence is either due to it being more frequent, or by accounting for a larger proportion of the document's tokens.

how to calculate term-document matrix?

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)
The output is as follows I am not getting how this matrix has been calculated.please discuss the example shown in the code. I have read one more example from the Wikipedia but could not understand.
The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.
As for how the calculation is done, you can have a look at the official documentation here.
The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.
Basically, the steps are as follow:
Step1 - Collect all different terms from all the documents present in fit().
For your data, they are
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
This is available from vectorizer.get_feature_names()
Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.
In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
First 1 1 1 1 1 0 1
Sec 0 1 1 0 0 1 0
You can get the above result by calling X.toarray().
In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.
<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1
<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1
<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.

How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words?

I am using scikit learn's Kmeans algorithm to cluster comments.
sentence_list=['hello how are you', "I am doing great", "my name is abc"]
vectorizer=TfidfVectorizer(min_df=1, max_df=0.9, stop_words='english', decode_error='ignore')
vectorized=vectorizer.fit_transform(sentence_list)
km=KMeans(n_clusters=num_clusters, init='k-means++',n_init=10, verbose=1)
km.fit(vectorized)
when i print the output of vectorized, it gives me the index of the words and the tf-idf scores of the index.
So im wondering, given we only get the tfidf scores of words, how is it that we manage to cluster documents based on individual words and not the score of an entire document? Or maybe it does this..Can someone explain to me the concept behind this?
You should take a look at how the Kmeans algorithm works. First the stop words never make it to vectorized, therefore are totally ignored by Kmeans and don't have any influence in how the documents are clustered. Now suppose you have:
sentence_list=["word1", "word2", "word2 word3"]
Lets say you want 2 clusters. In this case you expect the second and the third document to be in the same cluster because they share a common word. Lets see how this happens.
The numeric representation of the docs vectorized looks like:
word1 word3 word2
1 0.000000 0.000000 # doc 1
0 1.000000 0.000000 # doc 2
0 0.605349 0.795961 # doc 3
In the first step of Kmeans, some centroids are randomly chosen from the data, for example, the document 1 and the document 3 will be the initial centroids:
Centroid 1: [1, 0.000000, 0.000000]
Centroid 2: [0, 0.605349, 0.795961]
Now if you calculate the distances from every point(document) to each one of the two centroids, you will see that:
document 1 has distance 0 to centroid 1 so it belongs to centroid 1
document 3 has distance 0 to centroid 2 so it belongs to centroid 2
Finally we calculate the distance between the remaining document 2 and each centroid to find out which one it belongs to:
>>> from scipy.spatial.distance import euclidean
>>> euclidean([0, 1, 0], [1, 0, 0]) # dist(doc2, centroid1)
1.4142135623730951
>>> euclidean([0, 1, 0], [0, 0.605349, 0.795961]) # dist(doc2, centroid2)
0.8884272507056005
So the 2nd document and the second centroid are closer, this means that the second document is assigned to the 2nd centroid.
TF/IDF is a measure that calculates the importance of a word in a document with respect to the rest of the words in that document. It does not compute the importance of a standalone word. (and it makes sense, right? Because importance always means priviledge over others!). So TF/IDF of each word is actually an importance measure of a document with respect to the word.
I don't see where TF/IDF is used in your code. However, it is possible to compute the kmeans algorithm with TF/IDF scores used as features. Also, clustering for the three sample documents that you have mentioned is simply impossible, while no two documents there have a common word!
Edit 1: First of all, if the word 'cat' occurs in two documents it is possible that they would be clustered together (depending on other words in the two documents and also other documents). Secondly, you should learn more about k-means. You see, kmeans uses features to cluster documents together, and each tf/idf score for each word in a document is a feature measure that's been used to compare that document against others on a corpus.

Get similarity percent with sklearn hashing vectorizer

I have python program, that fetch article from few sites and store them on database, in my case, when I wan't add new article in database, I should check it's not a duplicate article. I want do this work simply with get percent of similarity and setting a threshold for it(for example, i say if (percent of similarity two string) > 70% then new article is duplicate)
My problem is finding percent of similarity. now I use difflib and SequenceMatcher class:
diff = SequenceMatcher(
None, article1.content, article2.content).ratio()
But it 's not right and I think using HashingVectorizer is better for this case(?):
vectorizer = HashingVectorizer(n_features=(2**18))
article1_vector = vectorizer.transform([article1.content])
article2_vector = vectorizer.transform([article2.content])
How can I get percent of similarity two hashvector(for example cosine distance) and how can I convert it to percent? thanks for your answers.
With the default settings for HashingVectorizer (in particular, norm="l2"), the cosine similarity between these two vectors is
sim = (article1_vector * article2_vector.T).A[0, 0]
This is really just a dot product with some trickery to get rid of the SciPy sparse matrix format.
This gives a similarity between -1 and 1, so you could add one and divide by two to get a percentage.

Categories