EDIT: The correct version of the code which works can be found at :
https://github.com/a7x/NaiveBayes-Classifier
I used data from openClassroom and started working on a small version of Naive Bayes in Python. Steps were the usual training and then prediction . I have a few questions and want to know why the accuracy is quite bad.
For training, I calculated the log likelihood by the formula :
log( P ( word | spam ) +1 ) /( spamSize + vocabSize .)
My question is: why did we add the vocabSize in this case :( and is this the correct way of going about it? Code used is below:
#This is for training. Calculate all probabilities and store them in a vector. Better to store it in a file for easier access
from __future__ import division
import sys,os
'''
1. The spam and non-spam is already 50% . So they by default are 0.5
2. Now we need to calculate probability of each word , in spam and non-spam separately
2.1 we can make two dictionaries, defaultdicts basically, for spam and non-spam
2.2 When time comes to calculate probabilities, we just need to substitute values
'''
from collections import *
from math import *
spamDict = defaultdict(int)
nonspamDict = defaultdict(int)
spamFolders = ["spam-train"]
nonspamFolders = ["nonspam-train"]
path = sys.argv[1] #Base path
spamVector = open(sys.argv[2],'w') #WRite all spam values into this
nonspamVector = open(sys.argv[3],'w') #Non-spam values
#Go through all files in spam and iteratively add values
spamSize = 0
nonspamSize = 0
vocabSize = 264821
for f in os.listdir(os.path.join(path,spamFolders[0])):
data = open(os.path.join(path,spamFolders[0],f),'r')
for line in data:
words = line.split(" ")
spamSize = spamSize + len(words)
for w in words:
spamDict[w]+=1
for f in os.listdir(os.path.join(path,nonspamFolders[0])):
data = open(os.path.join(path,nonspamFolders[0],f),'r')
for line in data:
words = line.split(" ")
nonspamSize = nonspamSize + len(words)
for w in words:
nonspamDict[w]+=1
logProbspam = {}
logProbnonSpam = {} #This is to store the log probabilities
for k in spamDict.keys():
#Need to calculate P(x | y = 1)
numerator = spamDict[k] + 1 # Frequency
print 'Word',k,' frequency',spamDict[k]
denominator = spamSize + vocabSize
p = log(numerator/denominator)
logProbspam[k] = p
for k in nonspamDict.keys():
numerator = nonspamDict[k] + 1 #frequency
denominator = nonspamSize + vocabSize
p = log(numerator/denominator)
logProbnonSpam[k] = p
for k in logProbnonSpam.keys():
nonspamVector.write(k+" "+str(logProbnonSpam[k])+"\n")
for k in logProbspam.keys():
spamVector.write(k+" "+str(logProbspam[k])+"\n")
For prediction, I just took a mail , split it into words, added all the probabilities, separately for
spam/non-spam, and multiplied them by 0.5. Whichever was higher was the class label. Code is below:
http://pastebin.com/8Y6Gm2my ( Stackoverflow was again playing games for some reason :-/)
EDIT : I have removed the spam = spam + 1 thing. Instead, I just ignore those words
Problem : My accuracy is quite bad . As noted below.
No of files in spam is 130
No. of spam in ../NaiveBayes/spam-test is 53 no. of non-spam 77
No of files in non-spam is 130
No. of spam in ../NaiveBayes/nonspam-test/ is 6 no. of non-spam 124
Please tell me where all I am going wrong. I think an accuracy of less than 50% means there must be some glaring error(s) in the implementation.
There are multiple errors and bad assumptions in your program - in both parts of it. Here are several.
You hardcode into your program the fact that you have the same number of spam and non-spam emails. I'd recommend not to hardcode this assumption. This is not absolutely essential, but in a more general case you will need to remove it.
You hadrcode into your program some number that you treat as the vocabulary size. I'd not recommend doing this as this number may change on any modification of the training set. Furthermore, actually it's incorrect. I'd recommend to calculate it during learning.
This may be not a mistake, but you seem to have a vocabulary of all the words in the training set. This may be suboptimal; actually the page you refer to recommends to take into account only the top-2500 words over all the emails. However, that's not essential for obtaining correct results - even without this filtering, my implementation is getting only several emails unclassified.
You incorrectly account for words that have been observed in spam or non-spam only. The log-probability for them being found in the other subset is not 1 which you add, but log(1/(spamSize+vocabSize)) or log(1/(nonspamSize+vocabSize)) depending on its group. This is actually very important - you need to store this probability with your data for the program to function correctly.
You do not ignore words never observed in the training set. Actually these may be treated in different ways, but you should take them into account.
Due to incorrect indentation in the prediction function, you predict using not the whole message, but only the first line of the message. Just a programming bug.
Update. You have fixed 6. Also 1 is not strictly nessesary to fix while you're working with this dataset, as well as 3 is not required.
Your modification did not correctly fix either 4 or 5. First, if the word has never been observed in some set, the probability of the message in it should decrease. Ignoring the word is not a good idea, you need to account for it as a highly unprobable one.
Second, your current code is asymmetric as the word being absent in spam cancels the check for non-spam (but not the other way). If you need to do nothing in an exception handler, use pass, not continue, as the latter immediately goes to the next for w in words: iteration.
The problem number 2 is also still in place - the vocabulary size you use does not match the real one. It must be the number of different words observed in the training set, not the total number of words in all the messages together.
Here's at least one of the errors you're making: you're storing log-probabilities in your model file (as you should), but then in the prediction code you're pretending that they are straight probabilities:
totalSpamP = spamP * 0.5
should be
totalSpamP = spamP + math.log(0.5)
Also, I don't get what this line is doing:
spamP = spamP + 1
It seems to be making up for a feature not found in the spam part of the training set, but those words should simply be ignored. Right now, it's adding e (exp(1)) to a probability, which by definition is invalid.
(As an aside, I just tried classification on this training set with my own implementation of Naive Bayes and got 97.6% accuracy, so that's the figure you should be aiming at :)
Related
I'm learning about genetic algorithms and in order to better understand the concepts I tried to build genetic algorithm from scratch using python without using any external module (just the standard library and a little bit of numpy)
The goal is to find a target string, so if I give it the string hello and define 26 chars + a space, there are 26^5 possibilities which is huge. Thus the need to use a GA to solve this probem.
I defined the following functions:
Generate population : we generate the population given size n and a target we generate n string having len(target) of random chars, we return the population as a list of str
Compute a fitness score: if the char at position i is equal to the char at position i of target we increment the score, here's the code:
def fitness(indiv,target):
score = 0
#print(indiv," vs ",target)
for idx,char in enumerate(list(target)):
if char == indiv[idx]:
score += 1
else:
score = 0
return score
Select parents, crossing between parents and generating a new population of children
Here are the function responsible for that:
from numpy.random import choice
def crossover(p1,p2):
# we define a crossover between p1 and p2 (single point cross over)
point = random.choice([i for i in range (len(target))])
#print("Parents:",p1,p2)
# C1 and C2 are the new children, before the cross over point they are equalt to their prantes, after that we swap
c = [p1[i] for i in range(point)]
#print("Crossover point: ",point)
for i in range(point,len(p1)):
c.append(p2[i])
#print("Offsprings:", c1," and ", c2)
c = "".join(c)
# we mutate c too
c = mutate(c)
return c
def mutate(ind):
point = random.choice([i for i in range (len(target))])
new_ind = list(ind)
new_ind[point] = random.choice(letters)
return "".join(new_ind)
def select_parent(new_pop,fit_scores):
totale = sum(fit_scores)
probs = [score/totale for score in fit_scores]
parent = choice(new_pop,1,p=probs)[0]
return parent
I'm selecting parents by computing the probabilities of each individual (individual score/ total score of population), then using a weighted random choice function to select a parent (this is a numpy function).
For the crossover, I'm generating a child c and a random splitting point, all chars before this random point are the first parent chars, and all chars after the splitting point are chars from the parent.
besides that I defined a function called should_stop which check whether we found the target, and print_best which gets the best individuals out of a population (highest fitness score).
Then I created a find function that use all the functions defined above:
def find(size,target,pop):
scores = [fitness(ind,target) for ind in pop]
#print("len of scores is ", len(scores))
#good_indiv = select_individuals(pop,scores)
#print("Length of good indivs is", len(good_indiv))
new_pop = []
# corssover good individuals
for ind in pop:
pa = select_parent(pop,scores)
pb = select_parent(pop,scores)
#print(pa,pb)
child = crossover(pa,pb)
#print(type(child))
new_pop.append(child)
best = print_best(new_pop,scores)
print("********** The best individual is: ", best, " ********")
return (new_pop,best)
n = 200
target = "hello"
popu = generate_pop(n,target)
#find(n,target,popu)
for i in range(1000):
print(len(popu))
data = find(n,target,popu)
popu = data[0]
print("iteration number is ", i)
if data[1] == target:
break
The Problem The problem is that it's taking too many iterations than it shoud be to generate hello (more than 200 iterations most of the time), while in this example, it only takes few iterations: https://jbezerra.github.io/The-Shakespeare-and-Monkey-Problem/index.html
Sure the problem is not coded in the same way, I used python and a procedural way to code things but the logic is the same. So what I'm doing wrong ?
You mutate 100% of the time. You select 'suitable' parents which are likely to produce a fit offspring, but then you apply a mutation that's more likely than not to "throw it off". The example link your provided behaves the same way if you increase mutation rate to 100%.
The purpose of mutation is to "nudge" the search in a different direction if you appear to be stuck in a local optimum, applying it all the time turns this from an evolutionary algorithm to something much closer to random search.
The idea of genetic algorithms supports that best ones survive and create new generations
First off all you should keep best ones in the every generation for the next generation (for example best 40% of every generation keep living on the next generatio) and you should breed those 40 percent with each other and mutate only limited number of individual in every generation those numbers should be low like lower than 5% of the individuals mutates i believe this will reduce the number of generations
I would suggest define your strings in a dictionary and give a number to them
then analyse this arrays
example
my dictionary is
I : 1
eat : 23
want : 12
to : 2
so I want to eat
convert to [ 1 , 12, 2, 23]
so the randomness is reduce by a factor.
here the words are inferred from dictionary
so the only variable is the order and which words appear in your string.
re-write you algorithm with a dictionary
your algo run time will improve by a factor.
with regards
teja
I've coded up a Naive Bayes Classifier, but it doesn't seem to be working particularly well. Counting the words etc. is not a problem but the probabilities have been.
The method I've been using starts at page 180 in this book
But I'll use the terms from the wiki article to make it more universal.
Training
With training I'm creating a probability for every word occurring in a category:
for category in categories:
for word in category_vocabulary[category]:
word_probability[category][word] = (category_vocabulary[category][word] + 1) / (total_words_in_category[category] + len(vocabulary))
So I get the total of number of times a word occurs in a category, add one, and then divide that by total words in a category, plus the size of the vocabulary (distinct words). This is P(xi|Ck)
I also calculate the probability of a category p(Ck), category_probability, which is simply the amount of words in a category divided by the words in all categories
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
Classifying
For classification I loop through all the tokens of the document to be classified, and calculate the product of word_probability for all the words in the text.
for category in categories:
if word in word_probability[category]:
if final_probability[category] == 0:
final_probability[category] = word_probability[category][word]
else:
final_probability[category] *= word_probability[category][word]
Lastly to calculate a score I multiply this by the category probability
score = category_probability[category] * final_probability[category]
This score seems to be completely wrong and I'm not sure what to do. When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
Any help would be much appreciated as I imagine what I'm doing wrong is somewhat obvious to somebody that better understands it.
This score seems to be completely wrong and I'm not sure what to do.
First of all, category probability is not estimated by the fraction of words in a category vs. total number of words
for category in categories:
category_probability[category] = total_words_in_category[category] / sum(total_words_in_category.values())
but numnber of sentences in a category vs total number of sentences (or paragraphs, documents, objects - the thing you are classifing). Thus
for category in categories:
category_probability[category] = total_objects_in_category[category] / sum(total_objects_in_category.values())
When I've looked up other peoples methods they seem to involve a few logs and exponents but I'm not sure how they fit in with the book and wiki article.
This is because direct probability computation (which you do) is numerically unstable. You will end up multipling lots of tiny numbers, thus precision will fall exponentialy. Consequently one uses this simple mathematical equality:
PROD_i P(x) = exp [ log [ PROD_i P_i(x) ] ] = exp [ SUM_i log P_i(X) ]
Thus instead of storing probabilities you store logarithms of probabilities, and instead of multiplying them, you sum them. If you want to recover true probability all you have to do is take exp value, but for classification you do not have to, as P(x) > P(y) <-> log P(x) > log P(y)
I'm constructing a Naive Bayes text classifier from scratch in Python and I am aware that, upon encountering a product of very small probabilities, using a logarithm over the probabilities is a good choice.
The issue now, is that the mathematical function that I'm using has a summation OVER a product of these extremely small probabilities.
To be specific, I'm trying to calculate the total word probabilities given a mixture component (class) over all classes.
Just plainly adding up the logs of these total probabilities is incorrect, since the log of a sum is not equal to the sum of logs.
To give an example, lets say that I have 3 classes, 2000 words and 50 documents.
Then I have a word probability matrix called wordprob with 2000 rows and 3 columns.
The algorithm for the total word probability in this example would look like this:
sum = 0
for j in range(0,3):
prob_product = 1
for i in words: #just the index of words from my vocabulary in this document
prob_product = prob_product*wordprob[i,j]
sum = sum + prob_product
What ends up happening is that prob_product becomes 0 on many iterations due to many small probabilities multiplying with each other.
Since I can't easily solve this with logs (because of the summation in front) I'm totally clueless.
Any help will be much appreciated.
I think you may be best to keep everything in logs. The first part of this, to compute the log of the product is just adding up the log of the terms. The second bit, computing the log of the sum of the exponentials of the logs is a bit trickier.
One way would be to store each of the logs of the products in an array, and then you need a function that, given an array L with n elements, will compute
S = log( sum { i=1..n | exp( L[i])})
One way to do this is to find the maximum, M say, of the L's; a little algebra shows
S = M + log( sum { i=1..n | exp( L[i]-M)})
Each of the terms L[i]-M is non-positive so overflow can't occur. Underflow is not a problem as for them exp will return 0. At least one of them (the one where L[i] is M) will be zero so it's exp will be one and we'll end up with something we can pass to log. In other words the evaluation of the formula will be trouble free.
If you have the function log1p (log1p(x) = log(1+x)) then you could gain some accuracy by omitting the (just one!) i where L[i] == M from the sum, and passing the sum to log1p instead of log.
your question seems on the math side of things rather than the coding of it.
I haven't quite figured out what your issue is but the sum of logs equals the log of the products. Dont know if that helps..
Also, you are calculating one prob_product for every j but you are just using the last one (and you are re-initializing it). you meant to do one of two things: either initialize it before the j-loop or use it before you increment j. Finally, i doesnt look that you need to initialize sum unless this is part of yet another loop you are not showing here.
That's all i have for now.
Sorry for the long post and no code.
High school algebra tells you this:
log(A*B*....*Z) = log(A) + log(B) + ... + log(Z) != log(A + B + .... + Z)
(1)My goal:
I am trying to use SVM to classify 10000 documents(each with 400 words) into 10 classes(evenly distributed). The features explored in my work include word n-gram (n=1~4),character n-gram(n=1~6).
(2)My approach: I am representing each document using vectors of frequency values for each feature in the document. And using TF-IDF to formalize the vectors. parts of my code are below:
def commonVec(dicts,count1,count2):
''' put features with frequency between count1 and count2 into a common vector used for SVM training'''
global_vector = []
master = {}
for i, d in enumerate(dicts):
for k in d:
master.setdefault(k, []).append(i)
for key in master:
if (len(master[key])>=count1 and len(master[key])<=count2):
global_vector.append(key)
global_vector1 = sorted(global_vector)
return global_vector1
def featureComb(mix,count1,count2,res1):
'''combine word n-gram and character n-gram into a vector'''
if mix[0]:
common_vector1 = []
for i in mix[0]:
dicts1 = []
for res in res1: #I stored all documents into database. res1 is the document result set and res is each document.
dicts1.append(ngram.characterNgrams(res[1], i)) # characterNgrams()will return a dictionary with feature name as the key, frequency as the value.
common_vector1.extend(commonVec(dicts1, count1, count2))
else:
common_vector1 = []
if mix[1]:
common_vector2 = []
for j in mix[1]:
dicts2 = []
for res in res1:
dicts2.append(ngram.wordNgrams(res[1], j))
common_vector2.extend(commonVec(dicts2, count1, count2))
else:
common_vector2 = []
return common_vector1+common_vector2
def svmCombineVector(mix,global_combine,label,X,y,res1):
'''Construct X vector that can be used to train SVM'''
lstm = []
for res in res1:
y.append(label[res[0]]) # insert class label into y
dici1 = {}
dici2 = {}
freq_term_vector = []
for i in mix[0]:
dici1.update(ngram.characterNgrams(res[1], i))
freq_term_vector.extend(dici1[gram] if gram in dici1 else 0 for gram in global_combine)
for j in mix[1]:
dici2.update(ngram.wordNgrams(res[1], j))
freq_term_vector.extend(dici2[gram] if gram in dici2 else 0 for gram in global_combine)
lstm.append(freq_term_vector)
freq_term_matrix = np.matrix(lstm)
transformer = TfidfTransformer(norm="l2")
tfidf = transformer.fit_transform(freq_term_matrix)
X.extend(tfidf.toarray())
X = []
y = []
character = [1,2,3,4,5,6]
word = [1,2,3,4]
mix = [character,word]
global_vector_combine = featureComb(mix, 2, 5000, res1)
print len(global_vector_combine) # 542401
svmCombineVector(mix,global_vector_combine,label,X,y,res1)
clf1 = svm.LinearSVC()
clf1.fit(X, y)
(3)My problem: However, when I ran the source code, a memory error occurred.
Traceback (most recent call last):
File "svm.py", line 110, in <module>
functions.svmCombineVector(mix,global_vector_combine,label,X,y,res1)
File "/home/work/functions.py", line 201, in svmCombineVector
X.extend(tfidf.toarray())
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 901, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 269, in toarray
B = self._process_toarray_args(order, out)
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 789, in _process_toarray
_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
I really have a hard time with it and need help from stackoverflow.
Could anyone explain some details and give me some idea how to solve it?
could anyone check my source code and show me some other methods to make use of memory more effectively?
The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of selected words that say a lot about whether the document is spam. These use stemming and other normalization tricks to make the features more effective.
You need to spend some time thinning out your features. Think about which features are most likely to contain information useful for this task. Experiment with different features. As long as you keep throwing everything but the kitchen sink in, you'll get memory errors. Right now you're trying to pass 10000 data points with 542401 dimensions each to your SVM. That's 542401 * 10000 * 4 = 21 gigabytes (conservatively) of data. My computer only has 4 gigabytes of RAM. You've got to pare this way down.1
A first step towards doing so would be to think about how big your total vocabulary size is. Each document has only 400 words, but let's say those 400 words are taken from a vocabulary of 5000 words. That means there will be 5000 ** 4 = 6.25 * 10 ** 14 possible 4-grams. That's half a quadrillion possible 4-grams. Of course not all those 4-grams will appear in your documents, but this goes a long way towards explaining why you're running out of memory. Do you really need these 4-grams? Could you get away with 2-grams only? There are a measly 5000 ** 2 = 25 million possible 2-grams. That will fit much more easily in memory, even if all possible 2-grams appear (unlikely).
Also keep in mind that even if the SVM could handle quadrillions of datapoints, it would probably give bad results, because when you give any learning algorithm too many features, it will tend to overfit, picking up on irrelevant patterns and overgeneralizing from them. There are ways of dealing with this, but it's best not to deal with it at all if you can help it.
I will also mention that these are not "newbie" problems. These are problems that machine learning specialists with PhDs have to deal with. They come up with lots of clever solutions, but we're not so clever that way, so we have to be clever a different way.
Although I can't offer you specific suggestions for cleverness without knowing more, I would say that, first, stemming is a good idea in at least some cases. Stemming simply removes grammatical inflection, so that different forms of the same word ("swim" and "swimming") are treated as identical. This will probably reduce your vocabulary size significantly, at least if you're dealing with English text. A common choice is the porter stemmer, which is included in nltk, as well as in a number of other packages. Also, if you aren't already, you should probably strip punctuation and reduce all words to lower-case. From there, it really depends. Stylometry (identifying authors) sometimes requires only particles ("a", "an", "the"), conjunctions ("and", "but") and other very common words; spam, on the other hand, has its own oddball vocabularies of interest. At this level, it is very difficult to say in advance what will work; you'll almost certainly need to try different approaches to see which is most effective. As always, testing is crucial!
1. Well, possibly you have a huge amount of RAM at your disposal. For example, I have access to a machine with 48G of RAM at my current workplace. But I doubt it could handle this either, because the SVM will have its own internal representation of the data, which means there will be at least one copy at some point; if a second copy is needed at any point -- kaboom.
I am solving the homework-1 of Caltech Machine Learning Course (http://work.caltech.edu/homework/hw1.pdf) . To solve ques 7-10 we need to implement a PLA. This is my implementation in python:
import sys,math,random
w=[] # stores the weights
data=[] # stores the vector X(x1,x2,...)
output=[] # stores the output(y)
# returns 1 if dot product is more than 0
def sign_dot_product(x):
global w
dot=sum([w[i]*x[i] for i in xrange(len(w))])
if(dot>0):
return 1
else :
return -1
# checks if a point is misclassified
def is_misclassified(rand_p):
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
# loads data in the following format:
# x1 x2 ... y
# In the present case for d=2
# x1 x2 y
def load_data():
f=open("data.dat","r")
global w
for line in f:
data_tmp=([1]+[float(x) for x in line.split(" ")])
data.append(data_tmp[0:-1])
output.append(data_tmp[-1])
def train():
global w
w=[ random.uniform(-1,1) for i in xrange(len(data[0]))] # initializes w with random weights
iter=1
while True:
rand_p=random.randint(0,len(output)-1) # randomly picks a point
check=[0]*len(output) # check is a list. The ith location is 1 if the ith point is correctly classified
while not is_misclassified(rand_p):
check[rand_p]=1
rand_p=random.randint(0,len(output)-1)
if sum(check)==len(output):
print "All points successfully satisfied in ",iter-1," iterations"
print iter-1,w,data[rand_p]
return iter-1
sign=output[rand_p]
w=[w[i]+sign*data[rand_p][i] for i in xrange(len(w))] # changing weights
if iter>1000000:
print "greater than 1000"
print w
return 10000000
iter+=1
load_data()
def simulate():
#tot_iter=train()
tot_iter=sum([train() for x in xrange(100)])
print float(tot_iter)/100
simulate()
The problem according to the answer of question 7 it should take around 15 iterations for perceptron to converge when size of training set but the my implementation takes a average of 50000 iteration . The training data is to be randomly generated but I am generating data for simple lines such as x=4,y=2,..etc. Is this the reason why I am getting wrong answer or there is something else wrong. Sample of my training data(separable using y=2):
1 2.1 1
231 100 1
-232 1.9 -1
23 232 1
12 -23 -1
10000 1.9 -1
-1000 2.4 1
100 -100 -1
45 73 1
-34 1.5 -1
It is in the format x1 x2 output(y)
It is clear that you are doing a great job learning both Python and classification algorithms with your effort.
However, because of some of the stylistic inefficiencies with your code, it makes it difficult to help you and it creates a chance that part of the problem could be a miscommunication between you and the professor.
For example, does the professor wish for you to use the Perceptron in "online mode" or "offline mode"? In "online mode" you should move sequentially through the data point and you should not revisit any points. From the assignment's conjecture that it should require only 15 iterations to converge, I am curious if this implies the first 15 data points, in sequential order, would result in a classifier that linearly separates your data set.
By instead sampling randomly with replacement, you might be causing yourself to take much longer (although, depending on the distribution and size of the data sample, this is admittedly unlikely since you'd expect roughly that any 15 points would do about as well as the first 15).
The other issue is that after you detect a correctly classified point (cases when not is_misclassified evaluates to True) if you then witness a new random point that is misclassified, then your code will kick down into the larger section of the outer while loop, and then go back to the top where it will overwrite the check vector with all 0s.
This means that the only way your code will detect that it has correctly classified all the points is if the particular random sequence that it evaluates them (in the inner while loop) happens to be a string of all 1's except for the miraculous ability that on any particular 0, on that pass through the array, it classifies correctly.
I can't quite formalize why I think that will make the program take much longer, but it seems like your code is requiring a much stricter form of convergence, where it sort of has to learn everything all at once on one monolithic pass way late in the training stage after having been updated a bunch already.
One easy way to check if my intuition about this is crappy would be to move the line check=[0]*len(output) outside of the while loop all together and only initialize it one time.
Some general advice to make the code easier to manage:
Don't use global variables. Instead, let your function to load and prep the data return things.
There are a few places where you say, for example,
return (True if sign_dot_product(data[rand_p])!=output[rand_p] else False)
This kind of thing can be simplified to
return sign_dot_product(data[rand_p]) != output[rand_p]
which is easier to read and conveys what criteria you're trying to check for in a more direct manner.
I doubt efficiency plays an important role since this seems to be a pedagogical exercise, but there are a number of ways to refactor your use of list comprehensions that might be beneficial. And if possible, just use NumPy which has native array types. Witnessing how some of these operations have to be expressed with list operations is lamentable. Even if your professor doesn't want you to implement with NumPy because she or he is trying to teach you pure fundamentals, I say just ignore them and go learn NumPy. It will help you with jobs, internships, and practical skill with these kinds of manipulations in Python vastly more than fighting with the native data types to do something they were not designed for (array computing).