Naive Bayes text classification incorrect results - python

I've coded up a Naive Bayes Classifier, but it doesn't seem to be working particularly well. Counting the words etc. is not a problem but the probabilities have been.
The method I've been using starts at page 180 in this book
But I'll use the terms from the wiki article to make it more universal.
Training
With training I'm creating a probability for every word occurring in a category:
for category in categories:
for word in category_vocabulary[category]:
word_probability[category][word] = (category_vocabulary[category][word] + 1) / (total_words_in_category[category] + len(vocabulary))
So I get the total of number of times a word occurs in a category, add one, and then divide that by total words in a category, plus the size of the vocabulary (distinct words). This is P(xi|Ck)
I also calculate the probability of a category p(Ck), category_probability, which is simply the amount of words in a category divided by the words in all categories
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
Classifying
For classification I loop through all the tokens of the document to be classified, and calculate the product of word_probability for all the words in the text.
for category in categories:
if word in word_probability[category]:
if final_probability[category] == 0:
final_probability[category] = word_probability[category][word]
else:
final_probability[category] *= word_probability[category][word]
Lastly to calculate a score I multiply this by the category probability
score = category_probability[category] * final_probability[category]
This score seems to be completely wrong and I'm not sure what to do. When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
Any help would be much appreciated as I imagine what I'm doing wrong is somewhat obvious to somebody that better understands it.

This score seems to be completely wrong and I'm not sure what to do.
First of all, category probability is not estimated by the fraction of words in a category vs. total number of words
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
but numnber of sentences in a category vs total number of sentences (or paragraphs, documents, objects - the thing you are classifing). Thus
for category in categories:
category_probability[category] = total_objects_in_category[category] / sum(total_objects_in_category.values())
When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
This is because direct probability computation (which you do) is numerically unstable. You will end up multipling lots of tiny numbers, thus precision will fall exponentialy. Consequently one uses this simple mathematical equality:
PROD_i P(x) = exp [ log [ PROD_i P_i(x) ] ] = exp [ SUM_i log P_i(X) ]
Thus instead of storing probabilities you store logarithms of probabilities, and instead of multiplying them, you sum them. If you want to recover true probability all you have to do is take exp value, but for classification you do not have to, as P(x) > P(y) <-> log P(x) > log P(y)

Related

cosine similarity doc vectors and word vectors for topical prevalence using doc2vec

I have a corpus of 250k Dutch news articles 2010-2020 to which I've applied word2vec models to uncover relationships between sets of neutral words and dimensions (e.g. good-bad). Since my aim is also to analyze the prevalence of certain topics over time, I was thinking of using doc2vec instead so as to simultaneously learn word and document embeddings. The 'prevalence' of topics in a document could then be calculated as the cosine similarities between doc vectors and word embeddings (or combinations of word vectors). In this way, I can calculate the annual topical prevalence in the corpus and see whether there's any changes over time. An example of such an approach can be found here.
My issue is that the avg. yearly cosine similarities yield really strange results. As an example, the cosine similarities between document vectors and a mixture of keywords related to covid-19/coronavirus show a decrease in topical prevalence since 2016 (which obviously cannot be the case).
My question is whether the approach that I'm following is actually valid. Or that maybe there's something that I'm missing. A 250k documents and 100k + vocabulary should be sufficient enough?
Below is the code that I've written:
# Doc2Vec model
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(tokenized_docs)]
d2vmodel = Doc2Vec(docs, min_count = 5, vector_size = 200, window = 10, dm = 1)
docvecs = d2vmodel.docvecs
wordvecs = d2vmodel.wv
# normalize vector
from numpy.linalg import norm
def nrm(x):
return x/norm(x)
# topical prevalence per doc
def topicalprevalence(topic, docvecs, wordvecs):
proj_lst = []
for i in range(0, len(docvecs)):
topic_lst = []
for j in topic:
cossim = nrm(docvecs[i]) # nrm(wordvecs[j])
topic_lst.append(cossim)
topic_avg = sum(topic_lst) / len(topic_lst)
proj_lst.append(topic_avg)
topicsyrs = {
'topic': proj_lst,
'year': df['datetime'].dt.year
}
return pd.DataFrame(topicsyrs)
# avg topic prevalence per year
def avgtopicyear(topic, docvecs, wordvecs):
docs = topicalprevalence(topic, docvecs, wordvecs)
return pd.DataFrame(docs.groupby("year")["topic"].mean())
# run
covid = ['corona', 'coronapandemie', 'coronacrisis', 'covid', 'pandemie']
covid_scores = topicalprevalence(covid, docvecs, wordvecs)
The word-vec-to-doc-vec relatioships in modes that train both are interesting, but a bit hard to characterize as to what they really mean. In a sense the CBOW-like mode of dm=1 (PV-DM) mixes doc-vectors in as one equal word among the whole window, when training to predict the 'target' word. But in the skip-gram-mixed mode dm=0, dbow_words=1, there'll be window count context-word-vec-to-target-word pair cycles to every 1 doc-vec-to-target-word pair cycle, changing the relative weight.
So if you saw a big improvement in dm=0, dbow_words=1, it might also be because that made the model relatively more word-to-word trained. Varying window is another way to change that balance, or increase epochs, in plain dm=1 mode – which should also result in doc/word compatible training, though perhaps not at the same rate/balance.
Whether a single topicalprevalence() mean vector for a full year would actually be reflective of individual word occurrences for a major topic may or may not be a valid conjecture, depending on possible other changes in the training data. Something like a difference in the relative mix of other major categories in the corpus might swamp even a giant new news topic. (EG: what if in y2020 some new section or subsidiary with a different focus, like entertainment, launched? It might swamp the effects of other words, especially when compressing down to a single vector of some particular dimensionality.)
Someting like a clustering of the year's articles, and identification of the closest 1 or N clusters to the target-words, with their similarities, might be more reflective even if the population of articles in changing. Or, a plot of each year's full set of articles as a histogram-of-similarities to the target-words - which might show a 'lump' of individual articles (not losing their distinctiveness to a full-year average) developing, over time, closer to the new phenomenon.
Turns out that setting parameters to dm=0, dbow_words=1 allows for training documents and words in the same space, now yielding valid results.

How to get the probability percentage of a model.predict() when clustering documents

text = "Some random text string that I want to cluster"
Y = vectorizer.transform([text])
prediction = model.predict(Y)
print(prediction)
the above passes through a value which is a string and then it it returns the cluster group it thinks it belongs in (one of three).
How can I find out what the percentage of its prediction accuracy is. ie. this particular text is 90% consistent with group 1, the next text might be 45% consistent with group 2 but it will still go into group 2 none the less. I want to be able to catch items with a low accuracy.
Not at all, usually.
Even some (few) clusterers work with some probability inside, and may have a predict_proba function to get these values, these values rather capture a relative responsibility than an accuracy.

Complication using log-probabilities - Naive Bayes text classifier

I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.
The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.
To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.
Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.
To give an example, lets say that I have 3 classes, 2000 words and 50 documents.
Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.
The algorithm for the total word probability in this example would look like this:
sum = 0
for j in range(0,3):
prob_product = 1
for i in words: #just the index of words from my vocabulary in this document
prob_product = prob_product*wordprob[i,j]
sum = sum + prob_product
What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.
Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.
Any help will be much appreciated.
I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.
One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute
S = log( sum { i=1..n | exp( L[i])})
One way to do this is to find the maximum, M say, of the L's; a little algebra shows
S = M + log( sum { i=1..n | exp( L[i]-M)})
Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.
If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.
your question seems on the math side of things rather than the coding of it.
I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps..
Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.
That's all i have for now.
Sorry for the long post and no code.
High school algebra tells you this:
log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)

Average number of bits required to store one letter of British English using perfect compression in python

I have an assignment which is written under:
What is the average number of bits required to store one letter of British English if perfect
compression is used?
Since the entropy of an experiment can be interpreted as the minimum number of bits required to store its result. I tried making a program calculating the entropy of all letters and then add them all together to find the Entropy of all letters.
this gives me 4.17 bits but according to this link
With a perfect compression algorithm we should only need 2 bits per character!
So how do I implement this perfect compression algorithm on this?
import math
letters=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
sum =0
def find_perc(s):
perc=[0.082,0.015,0.028,0.043,0.127,0.022,0.02,0.061,0.07,0.002,0.008,0.04,0.024,0.067,0.075,0.019,0.001,0.060,0.063,0.091,0.028,0.01,0.023,0.001,0.02,0.001]
letter=['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
pos = 0
temp = s.upper()
if temp in letter:
for x in xrange(1,len(letter)):
if temp==letter[x]:
pos = x
return perc[pos]
def calc_ent(s):
P=find_perc(s)
sum=0
#Calculates the Entropy of the current letter
temp = P *(math.log(1/P)/math.log(2))
#Does the same thing just for binary entropy (i think)
#temp = (-P*(math.log(P)/math.log(2)))-((1-P)*(math.log(1-P)/math.log(2)))
sum=temp
return sum
for x in xrange(0,25):
sum=sum+calc_ent(letters[x])
print "The min bit is : %f"%sum
There is no such thing as perfect compression, since it is provably impossible to compute the number of bits if "perfect compression" is applied. See Kolmogorov Complexity.
You will not be able to implement a compressor in a few lines of code that approaches what appears to be the limits of the compressibility of English text by computer programs, around one bit per character. Humans may be able to do a little better.
The page you link to again links to this page:
Refining the Estimated Entropy of English by Shannon Game Simulation
If you read carefully, the entropy computed there is not naively computed using the probability of occurrences for each letters - instead, it is computed by
The subject was shown the previous 100 characters of text and asked to guess the next character until successful
So I think you are not wrong, only the method you use differ - using only naive occurrence probability data, you cannot compress the information that well, but if you take the context into consideration, then there are so much more the redundant information. E.g., e has a probability of 0.127, but for th_, e probably has something more like 0.3.

Implementation of Naive Bayes - accuracy issues

EDIT: The correct version of the code which works can be found at :
https://github.com/a7x/NaiveBayes-Classifier
I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.
For training, I calculated the log likelihood by the formula :
log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)
My question is: why did we add the vocabSize in this case :( and is this the correct way of going about it? Code used is below:
#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
from __future__ import division
import sys,os
'''
1. The spam and non-spam is already 50% . So they by default are 0.5
2. Now we need to calculate probability of each word , in spam and non-spam separately
2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
from math import *
spamDict = defaultdict(int)
nonspamDict = defaultdict(int)
spamFolders = ["spam-train"]
nonspamFolders = ["nonspam-train"]
path = sys.argv[1] #Base path
spamVector = open(sys.argv[2],'w') #WRite all spam values into this
nonspamVector = open(sys.argv[3],'w') #Non-spam values
#Go through all files in spam and iteratively add values
spamSize = 0
nonspamSize = 0
vocabSize = 264821
for f in os.listdir(os.path.join(path,spamFolders[0])):
data = open(os.path.join(path,spamFolders[0],f),'r')
for line in data:
words = line.split(" ")
spamSize = spamSize + len(words)
for w in words:
spamDict[w]+=1
for f in os.listdir(os.path.join(path,nonspamFolders[0])):
data = open(os.path.join(path,nonspamFolders[0],f),'r')
for line in data:
words = line.split(" ")
nonspamSize = nonspamSize + len(words)
for w in words:
nonspamDict[w]+=1
logProbspam = {}
logProbnonSpam = {} #This is to store the log probabilities
for k in spamDict.keys():
#Need to calculate P(x | y = 1)
numerator = spamDict[k] + 1 # Frequency
print 'Word',k,' frequency',spamDict[k]
denominator = spamSize + vocabSize
p = log(numerator/denominator)
logProbspam[k] = p
for k in nonspamDict.keys():
numerator = nonspamDict[k] + 1 #frequency
denominator = nonspamSize + vocabSize
p = log(numerator/denominator)
logProbnonSpam[k] = p
for k in logProbnonSpam.keys():
nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n")
for k in logProbspam.keys():
spamVector.write(k+" "+str(logProbspam[k])+"\n")
For prediction, I just took a mail , split it into words, added all the probabilities, separately for
spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:
http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)
EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words
Problem : My accuracy is quite bad . As noted below.
No of files in spam is 130
No. of spam in ../NaiveBayes/spam-test is 53 no. of non-spam 77
No of files in non-spam is 130
No. of spam in ../NaiveBayes/nonspam-test/ is 6 no. of non-spam 124
Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.
There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.
You hardcode into your program the fact that you have the same number of spam and non-spam emails. I'd recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.
You hadrcode into your program some number that you treat as the vocabulary size. I'd not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it's incorrect. I'd recommend to calculate it during learning.
This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that's not essential for obtaining correct results - even without this filtering, my implementation is getting only several emails unclassified.
You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not 1 which you add, but log(1/(spamSize+vocabSize)) or log(1/(nonspamSize+vocabSize)) depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.
You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.
Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.
Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.
Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass, not continue, as the latter immediately goes to the next for w in words: iteration.
The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.
Here's at least one of the errors you're making: you're storing log-probabilities in your model file (as you should), but then in the prediction code you're pretending that they are straight probabilities:
totalSpamP = spamP * 0.5
should be
totalSpamP = spamP + math.log(0.5)
Also, I don't get what this line is doing:
spamP = spamP + 1
It seems to be making up for a feature not found in the spam part of the training set, but those words should simply be ignored. Right now, it's adding e (exp(1)) to a probability, which by definition is invalid.
(As an aside, I just tried classification on this training set with my own implementation of Naive Bayes and got 97.6% accuracy, so that's the figure you should be aiming at :)

Categories