TFIDF calculating confusion - python

I found the following code on the internet for calculating TFIDF:
https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py
I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error:
return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList))))
But i am confused for two things:
I get negative values in some cases, is this correct?
I am confused with line 62, 63 and 64.
Code:
documentNumber = 0
for word in documentList[documentNumber].split(None):
words[word] = tfidf(word,documentList[documentNumber],documentList)
Should TFIDF be calculated on the first document only?

No. Tf-idf is tf, a non-negative value, times idf, a non-negative value, so it can never be negative. This code seems to be implementing the erroneous definition of tf-idf that's been on the Wikipedia for years (it's been fixed in the meantime).

If the word in question is contained in every document in the collection your 1+ change will result in a negative value. As 0 < (x / (1 + x)) < 1 holds for all x > 0. Which results in a negative logarithm.
In my opinion the correct IDF for a nonexistent word is infinite or undefined, but by adding 1+ to the denominator and the nominator a nonexistent word will have an IDF slightly higher than any existing word and words that exist in every document will have an IDF of zero. Both cases will probably work well with your code.

Related

Unknown pattern detection in a list of numbers

I have a sequence of numbers that follow some kind of arbitrary rule, let's imagine the following 5 examples:
A = [1,2,3,4]
B = [8,7,6,5,4,3,2]
C = [2,4,6,8,10,12]
D = [15,18,21,24]
E = [2,8,18,32,50]
Sequence A follows a rule of xn = xn-1+1 , where n0=1, sequence B follows a rule of xn = xn-1-2 where n0=8, and so on. Example E follows the more complex formula ni=2(i+1)2
How, using python, can I predict the next element of each sequence?
You can fit a curve using scipy.optimize.curve_fit if you have a specific function in mind, or you could do a numpy.polyfit if you're confident that the "and so on" is always going to conform to some polynomial - your examples are all linear, so that's just a polynomial of degree 1.
Here's an example of using numpy.polyfit:
import numpy as np
model = np.polyfit([0,1], [1,2],deg=1)
This will take in your values [1,2] and map them to positional values [0,1] before calculating the 1-degree polynomial that best fits their sequence.
You then need a function to use the model to predict the n'th value in the sequence (Alternatively, use poly1d) but here's a simple polynomial calculator function that accepts coefficients as the first parameter, and a value of x for which you want to return the result of the polynomial:
def poly(coeffs, x):
accumulator=0
n = len(coeffs)-1
for e,i in enumerate(coeffs):
accumulator = accumulator + (i*(x ** (n-e)))
return accumulator
So, we've trained it on a sequence with indices 0,1 - the answer for the 3rd point, with index 2 is found by:
poly(model,2)
Which returns the expected value of 3.
Here's an example using the sequence [3,6,9,12]:
model = np.polyfit([0,1,2,3], [3,6,9,12],deg=1)
poly(model,4)
Gives the answer 15. (OK, 15.000000000000002, but it's close enough - if you're confident that you're always going to arrive at integer answers then you could round to the closest integer - or choose some level of precision you're comfortable with)
This is all linear, for a quadratic model, you'd set the deg=1 to deg=2 and so on.
What this wont do for you is find more interesting patterns for which there isn't a polynomial to describe them. The On-Line Encyclopedia of Integer Sequences has a huge list of such sequences, but examples might include The Fibonacci Sequence, The Prime Number Sequence, or Triangular Numbers for these more interesting examples, you'll need to come up with a more nuanced approach.

How to choose the proper maximum value for Damerau-Levenshtein distance?

I am using the Damerau-Levenshtein code available from here in my similarity measurements. The problem is that when I apply the Damerau-Levenshtein on two strings such as cat sat on a mat and dog sat mat, I am getting edit distance as 8. This similarity results can get any number regarding insertion, deletion or substitution like any range from 0, 1, 2, ... . Now I am wondering if there is any way that we can assume or find a maximum of this distance (similarity) and converted between 0 and 1 or how can we set the max value that at least I can say: distance =1 - similarity.
The reason for this post is that I am setting a threshold for a few distance metrics like cosine, Levenstein and damerau levenstein and outputs of all should be betweeb zero and 1.
Levenshtein Distance score = number of insertion + number of deletion + number of substitution.
So the maximum value is 3 X(multiplied) the maximum length string in your data set.
The difficult thing is that the upper bound of Damerau-Levenshtein is infinite (given infinitely long words), but we can't practically make infinite strings.
If you wanted to be safe, you can use something that maps the range 0-> max length of a string onto the range 0->1. The max length of a string depends on the amount of memory you have (assuming 64 bit), so I'd recommend doing...not this. Source
Practically, you can also just check all of the strings you are about to compare and choose the length of the longest string in that list as the max value. Another solution is to compute all of the scores beforehand and apply the conversion factor after you know the max score. Some code that could do that:
def adjustScore(lists, maxNum):
scaleFactor = 1/maxNum
return [x * scaleFactor for x in lists]
testWords = ["test1", "testing2", "you", "must", "construct", "additional", "plyometrics"]
testScores = []
for i in range(len(testWords)-1):
testScores.append(damerau_levenshtein_distance(testWords[i], testWords[i+1]))
#method 1: just check the biggest score you got to obtain the max
max1 = max(testScores)
result = adjustScore(testScores, max1)
#method 2: if you need the adjusted score first, pick the longest string's length as max
lens = map(len, testWords)
max2 = max(lens)
result2 = adjustScore(testScores, max2)
These happen to give identical answers because most of the words are very different from each other but either of these approaches should work for most cases.
Long story short, the max distance between two strings is the length of the longer string.
Notes: if this maps in the wrong direction (i.e. high scores are showing low and vice versa, just add "1-" between the open bracket and x in adjustscore)
Also, if you want it to map do a different range, replace the 1 with a different max value.

Complication using log-probabilities - Naive Bayes text classifier

I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.
The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.
To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.
Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.
To give an example, lets say that I have 3 classes, 2000 words and 50 documents.
Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.
The algorithm for the total word probability in this example would look like this:
sum = 0
for j in range(0,3):
prob_product = 1
for i in words: #just the index of words from my vocabulary in this document
prob_product = prob_product*wordprob[i,j]
sum = sum + prob_product
What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.
Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.
Any help will be much appreciated.
I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.
One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute
S = log( sum { i=1..n | exp( L[i])})
One way to do this is to find the maximum, M say, of the L's; a little algebra shows
S = M + log( sum { i=1..n | exp( L[i]-M)})
Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.
If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.
your question seems on the math side of things rather than the coding of it.
I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps..
Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.
That's all i have for now.
Sorry for the long post and no code.
High school algebra tells you this:
log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)

How to get word count from TF*IDF value in sklearn

I want to get the count of a word in a given sentence using only tf*idf matrix of a set of sentences. I use TfidfVectorizer from sklearn.feature_extraction.text.
Example :
from sklearn.feature_extraction.text import TfidfVectorizer
sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
tfidf_matrix = vect.fit_transform(sentences).toarray()
I want to be able to calculate the number of times the term "sun" occurs in the first sentence (which is 2) using only tfidf_matrix[0] and probably vect.idf_ .
I know there are infinite ways to get term frequency and words count but I have a special case where I only have a tfidf matrix.
I already tried to divide the tfidf value of the word "sun" in the first sentence by its idf value to get tf. Then I multiplied tf by the total number of words in the sentence to get the words count. Unfortunately, I get wrong values.
The intuitive thing to do would be exactly what you tried: multiply each tf value by the number of words in the sentence you're examining. However, I think the key observation here is that each row has been normalized by its euclidean length. So multiplying each row by the number of words in that sentence is at best approximating the denormalized row, which is why you get weird values. AFAIK, you can't denormalize the tf*idf matrix without knowing the norms of each of the original rows ahead of time. This is primarily because there are an infinite number of vectors that can be mapped to any one normalized vector. So without the norms, you can't retrieve the correct magnitude of the original vector. See this answer for more details about what I mean.
That being said, I think there's a workaround in our case. We can at least retrieve the normalized ratios of the term counts in each sentence, i.e., sun appears twice as much as shiny. I found that normalizing each row so that the sum of the tf values is 1 and then multiplying those values by the length of the stopword-filtered sentences seems to retrieve the original word counts.
To demonstrate:
sentences = ("The sun is shiny i like the sun","I have been exposed to sun")
vect = TfidfVectorizer(stop_words="english",lowercase=False)
mat = vect.fit_transform(sentences).toarray()
q = mat / vect.idf_
sums = np.ones((q.shape[0], 1))
lens = np.ones((q.shape[0], 1))
for ix in xrange(q.shape[0]):
sums[ix] = np.sum(q[ix,:])
lens[ix] = len([x for x in sentences[ix].split() if unicode(x) in vect.get_feature_names()]) #have to filter out stopwords
sum_to_1 = q / sums
tf = sum_to_1 * lens
print tf
yields:
[[ 1. 0. 1. 1. 2.]
[ 0. 1. 0. 0. 1.]]
I tried this with a few more complicated sentences and it seems to work alright. Let me know if I missed anything.

Implementation of Naive Bayes - accuracy issues

EDIT: The correct version of the code which works can be found at :
https://github.com/a7x/NaiveBayes-Classifier
I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.
For training, I calculated the log likelihood by the formula :
log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)
My question is: why did we add the vocabSize in this case :( and is this the correct way of going about it? Code used is below:
#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
from __future__ import division
import sys,os
'''
1. The spam and non-spam is already 50% . So they by default are 0.5
2. Now we need to calculate probability of each word , in spam and non-spam separately
2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
from math import *
spamDict = defaultdict(int)
nonspamDict = defaultdict(int)
spamFolders = ["spam-train"]
nonspamFolders = ["nonspam-train"]
path = sys.argv[1] #Base path
spamVector = open(sys.argv[2],'w') #WRite all spam values into this
nonspamVector = open(sys.argv[3],'w') #Non-spam values
#Go through all files in spam and iteratively add values
spamSize = 0
nonspamSize = 0
vocabSize = 264821
for f in os.listdir(os.path.join(path,spamFolders[0])):
data = open(os.path.join(path,spamFolders[0],f),'r')
for line in data:
words = line.split(" ")
spamSize = spamSize + len(words)
for w in words:
spamDict[w]+=1
for f in os.listdir(os.path.join(path,nonspamFolders[0])):
data = open(os.path.join(path,nonspamFolders[0],f),'r')
for line in data:
words = line.split(" ")
nonspamSize = nonspamSize + len(words)
for w in words:
nonspamDict[w]+=1
logProbspam = {}
logProbnonSpam = {} #This is to store the log probabilities
for k in spamDict.keys():
#Need to calculate P(x | y = 1)
numerator = spamDict[k] + 1 # Frequency
print 'Word',k,' frequency',spamDict[k]
denominator = spamSize + vocabSize
p = log(numerator/denominator)
logProbspam[k] = p
for k in nonspamDict.keys():
numerator = nonspamDict[k] + 1 #frequency
denominator = nonspamSize + vocabSize
p = log(numerator/denominator)
logProbnonSpam[k] = p
for k in logProbnonSpam.keys():
nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n")
for k in logProbspam.keys():
spamVector.write(k+" "+str(logProbspam[k])+"\n")
For prediction, I just took a mail , split it into words, added all the probabilities, separately for
spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:
http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)
EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words
Problem : My accuracy is quite bad . As noted below.
No of files in spam is 130
No. of spam in ../NaiveBayes/spam-test is 53 no. of non-spam 77
No of files in non-spam is 130
No. of spam in ../NaiveBayes/nonspam-test/ is 6 no. of non-spam 124
Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.
There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.
You hardcode into your program the fact that you have the same number of spam and non-spam emails. I'd recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.
You hadrcode into your program some number that you treat as the vocabulary size. I'd not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it's incorrect. I'd recommend to calculate it during learning.
This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that's not essential for obtaining correct results - even without this filtering, my implementation is getting only several emails unclassified.
You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not 1 which you add, but log(1/(spamSize+vocabSize)) or log(1/(nonspamSize+vocabSize)) depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.
You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.
Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.
Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.
Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass, not continue, as the latter immediately goes to the next for w in words: iteration.
The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.
Here's at least one of the errors you're making: you're storing log-probabilities in your model file (as you should), but then in the prediction code you're pretending that they are straight probabilities:
totalSpamP = spamP * 0.5
should be
totalSpamP = spamP + math.log(0.5)
Also, I don't get what this line is doing:
spamP = spamP + 1
It seems to be making up for a feature not found in the spam part of the training set, but those words should simply be ignored. Right now, it's adding e (exp(1)) to a probability, which by definition is invalid.
(As an aside, I just tried classification on this training set with my own implementation of Naive Bayes and got 97.6% accuracy, so that's the figure you should be aiming at :)

Categories