How to get word count from TF*IDF value in sklearn

How to get word count from TF*IDF value in sklearn - python

I want to get the count of a word in a given sentence using only tf*idf matrix of a set of sentences. I use TfidfVectorizer from sklearn.feature_extraction.text.
Example :
from sklearn.feature_extraction.text import TfidfVectorizer
sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
tfidf_matrix = vect.fit_transform(sentences).toarray()
I want to be able to calculate the number of times the term "sun" occurs in the first sentence (which is 2) using only tfidf_matrix[0] and probably vect.idf_ .
I know there are infinite ways to get term frequency and words count but I have a special case where I only have a tfidf matrix.
I already tried to divide the tfidf value of the word "sun" in the first sentence by its idf value to get tf. Then I multiplied tf by the total number of words in the sentence to get the words count. Unfortunately, I get wrong values.

The intuitive thing to do would be exactly what you tried: multiply each tf value by the number of words in the sentence you're examining. However, I think the key observation here is that each row has been normalized by its euclidean length. So multiplying each row by the number of words in that sentence is at best approximating the denormalized row, which is why you get weird values. AFAIK, you can't denormalize the tf*idf matrix without knowing the norms of each of the original rows ahead of time. This is primarily because there are an infinite number of vectors that can be mapped to any one normalized vector. So without the norms, you can't retrieve the correct magnitude of the original vector. See this answer for more details about what I mean.
That being said, I think there's a workaround in our case. We can at least retrieve the normalized ratios of the term counts in each sentence, i.e., sun appears twice as much as shiny. I found that normalizing each row so that the sum of the tf values is 1 and then multiplying those values by the length of the stopword-filtered sentences seems to retrieve the original word counts.
To demonstrate:
sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
mat = vect.fit_transform(sentences).toarray()
q = mat / vect.idf_
sums = np.ones((q.shape[0], 1))
lens = np.ones((q.shape[0], 1))
for ix in xrange(q.shape[0]):
sums[ix] = np.sum(q[ix,:])
lens[ix] = len([x for x in sentences[ix].split() if unicode(x) in vect.get_feature_names()]) #have to filter out stopwords
sum_to_1 = q / sums
tf = sum_to_1 * lens
print tf
yields:
[[ 1. 0. 1. 1. 2.]
[ 0. 1. 0. 0. 1.]]
I tried this with a few more complicated sentences and it seems to work alright. Let me know if I missed anything.

Related

Generate equidistant multidimensional vectors as Embedding matrix

I need to generate an embedding matrix to use instead of the layer. I know a priori the similarity between the 10 features (all equidistant from each other) and I can't generate the matrices through training because I don't have enough data.
To do this I have to generate 10 vectors of arbitrary size (ie 10) but which all have the same size and which are all equidistant from each other, with values of the single dimensions being numbers between -1 and 1, all this in python.
Anyone know how this can be done?

I believe you have some words as features and you want to represent them as embedding vectors.
There are several ways to create word embeddings, I will mention a few of these from the simplest to the complex yet very powerful methods.
1. Count Vector.
It is a method of creating vectors out of your unique tokens. For example, if the vocabulary contains three words, say ["and", "basketball", "more"] , then thetext "more and more" will be mapped to the vector [1, 0, 2] : the word "and" appears once, the word "basketball" does not appear at all, and the word "more" appears twice. This text representation is called a bag of words, since it completely loses the order of the words.
2. TF-IDF( Term Frequency Inverse document frequency)
The problem with the count vector is that it ignores the important
word since it is having less appearance compared to common words. In the above example the term "basketball" is ignored and "more" is given importance. To overcome this TF-IDF approach is best suited, For example,
let’s imagine that the words "and" , "basketball" , and "more" appear respectively in 200, 10, and 100 text instances in the training set: in this case, the final vector will be [1/log(200), 0/log(10), 2/log(100)] , which is approximately equal to [0.19, 0.,
0.43] .
3. Pre-trained word vectors.
These are the embedding vectors trained on millions of text data available on Wikipedia or other general sources, it will be having all the common terms available in English. There are many open-sourced pre-trained word vectors are available some of the, are.
GoogleNews vector.
GloVe
fastText by Facebook.
You can choose the vector dimension based on the model's availability, for example you can choose 50,100,200,300 dimensions vector for each word.
from gensim.models import Word2Vec
#loading the downloaded model
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, norm_only=True)
#the model is loaded. It can be used to perform all of the tasks mentioned above.
# getting word vectors of a word
dog = model['dog']
For more details and other methods to create word embeddings, you can refer to this beautiful article written by NCC.
Hope this answers your question, Happy Learning!

Naive Bayes, Text Analysis, SKLearn

This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The 𝑥 -axis is a document count (𝑥𝑖) and the 𝑦 -axis is the
percentage of words that appear less than (𝑥𝑖) times. For example,
at 𝑥=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.

I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)

how to calculate term-document matrix?

I know that Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
I am using sklearn's CountVectorizer to extract features from strings( text file ) to ease my task. The following code returns a term-document matrix according to the sklearn_documentation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(min_df=1)
print(vectorizer)
content = ["how to format my hard disk", "hard disk format problems"]
X = vectorizer.fit_transform(content) #X is Term-document matrix
print(X)
The output is as follows I am not getting how this matrix has been calculated.please discuss the example shown in the code. I have read one more example from the Wikipedia but could not understand.

The output of a CountVectorizer().fit_transform() is a sparse matrix. It means that it will only store the non-zero elements of a matrix. When you do print(X), only the non-zero entries are displayed as you observe in the image.
As for how the calculation is done, you can have a look at the official documentation here.
The CountVectorizer in its default configuration, tokenize the given document or raw text (It will take only terms which have 2 or more characters in it) and count the word occurrences.
Basically, the steps are as follow:
Step1 - Collect all different terms from all the documents present in fit().
For your data, they are
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
This is available from vectorizer.get_feature_names()
Step2 - In the transform(), count the number of terms in each document which were present in the fit() output it in the term-frequency matrix.
In your case, you are supplying both documents to transform() (fit_transform() is a shorthand for fit() and then transform()). So, the result is
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
First 1 1 1 1 1 0 1
Sec 0 1 1 0 0 1 0
You can get the above result by calling X.toarray().
In the image of the print(X) you posted, the first column represents the index of the term-freq matrix and second represents the frequencey of that term.
<0,0> means first row, first column i.e frequencies of term "disk" (first term in our tokens) in first document = 1
<0,2> means first row, third column i.e frequencies of term "hard" (third term in our tokens) in first document = 1
<0,5> means first row, sixth column i.e frequencies of term "problems" (sixth term in our tokens) in first document = 0. But since it is 0, it is not displayed in your image.

Naive Bayes text classification incorrect results

I've coded up a Naive Bayes Classifier, but it doesn't seem to be working particularly well. Counting the words etc. is not a problem but the probabilities have been.
The method I've been using starts at page 180 in this book
But I'll use the terms from the wiki article to make it more universal.
Training
With training I'm creating a probability for every word occurring in a category:
for category in categories:
for word in category_vocabulary[category]:
word_probability[category][word] = (category_vocabulary[category][word] + 1) / (total_words_in_category[category] + len(vocabulary))
So I get the total of number of times a word occurs in a category, add one, and then divide that by total words in a category, plus the size of the vocabulary (distinct words). This is P(xi|Ck)
I also calculate the probability of a category p(Ck), category_probability, which is simply the amount of words in a category divided by the words in all categories
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
Classifying
For classification I loop through all the tokens of the document to be classified, and calculate the product of word_probability for all the words in the text.
for category in categories:
if word in word_probability[category]:
if final_probability[category] == 0:
final_probability[category] = word_probability[category][word]
else:
final_probability[category] *= word_probability[category][word]
Lastly to calculate a score I multiply this by the category probability
score = category_probability[category] * final_probability[category]
This score seems to be completely wrong and I'm not sure what to do. When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
Any help would be much appreciated as I imagine what I'm doing wrong is somewhat obvious to somebody that better understands it.

This score seems to be completely wrong and I'm not sure what to do.
First of all, category probability is not estimated by the fraction of words in a category vs. total number of words
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
but numnber of sentences in a category vs total number of sentences (or paragraphs, documents, objects - the thing you are classifing). Thus
for category in categories:
category_probability[category] = total_objects_in_category[category] / sum(total_objects_in_category.values())
When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
This is because direct probability computation (which you do) is numerically unstable. You will end up multipling lots of tiny numbers, thus precision will fall exponentialy. Consequently one uses this simple mathematical equality:
PROD_i P(x) = exp [ log [ PROD_i P_i(x) ] ] = exp [ SUM_i log P_i(X) ]
Thus instead of storing probabilities you store logarithms of probabilities, and instead of multiplying them, you sum them. If you want to recover true probability all you have to do is take exp value, but for classification you do not have to, as P(x) > P(y) <-> log P(x) > log P(y)

TFIDF calculating confusion

I found the following code on the internet for calculating TFIDF:
https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py
I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:
return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))
But i am confused for two things:
I get negative values in some cases, is this correct?
I am confused with line 62, 63 and 64.
Code:
documentNumber = 0
for word in documentList[documentNumber].split(None):
words[word] = tfidf(word,documentList[documentNumber],documentList)
Should TFIDF be calculated on the first document only?

No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).

If the word in question is contained in every document in the collection your 1+ change will result in a negative value. As 0 < (x / (1 + x)) < 1 holds for all x > 0. Which results in a negative logarithm.
In my opinion the correct IDF for a nonexistent word is infinite or undefined, but by adding 1+ to the denominator and the nominator a nonexistent word will have an IDF slightly higher than any existing word and words that exist in every document will have an IDF of zero. Both cases will probably work well with your code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.