Sentiment Classification with NLTK Naive Baysian classifier

Sentiment Classification with NLTK Naive Baysian classifier - python

I am implementing Naive Bayesian classifier with NLTK. But when i train classifier with extracted features it gives error "too many values to unpack". I am just beginner to python. Here is code. Program is reading text from files and extracting features from these files.
import nltk.classify.util,os,sys;
from nltk.classify import NaiveBayesClassifier;
from nltk.corpus import stopwords;
from nltk.tokenize import word_tokenize,RegexpTokenizer;
import re;
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)
def word_feats(words):
return dict([(word,True) for word in words])
def feature_extractor(sentiment):
path = "train/"+sentiment+"/"
files = os.listdir(path);
feats = {};
i = 0;
for file in files:
f = open(path+file,"r", encoding='utf-8');
review = f.read();
review = remove_tags(review);
stopWords = (stopwords.words("english"))
tokenizer = RegexpTokenizer(r"\w+");
tokens = tokenizer.tokenize(review);
features = word_feats(tokens);
feats.update(features)
return feats;
posative_feat = feature_extractor("pos");
p = open("posFeat.txt","w", encoding='utf-8');
p.write(str(posative_feat));
negative_feat = feature_extractor("neg");
n = open("negFeat.txt","w", encoding='utf-8');
n.write(str(negative_feat));
plength = int(len(posative_feat)*3/4);
nlength = int(len(negative_feat)*3/4)
totalLength = plength+nlength;
trainFeatList = {}
testFeatList = {}
i = 0
for items in posative_feat.items():
i +=1;
value = {items[0]:items[1]}
if(i<plength):
trainFeatList.update(value);
else:
testFeatList.update(value);
j = 0
for items in negative_feat.items():
j +=1;
value = {items[0]:items[1]}
if(j<plength):
trainFeatList.update(value);
else:
testFeatList.update(value);
classifier = NaiveBayesClassifier.train(trainFeatList)
print(nltk.classify.util.accuracy(classifier,testFeatList));
classifier.show_most_informative_features();

looking at the NLTK book page http://www.nltk.org/book/ch06.html it seems the data that is given to the NaiveBayesClassifier is of the type list(tuple(dict,str)) whereas the data you are passing to the classifier is of the type list(dict).
If you represent the data in a similar manner, you will get different results. Basically, it is a list of (feature dict, label).
There are multiple errors in your code:
Python does not use a semicolon as a line ending
The True boolean does not seem to serve a purpose on line 12
trainFeatList and testFeatList should be lists
each value in your feature items list should betuple(dict,str)
assign labels to features in the list (in (4))
take NaiveBayesClassifier, and any use of classifier out of the negative features loop
If you fix the previous errors, the classifier will work, but unless I know what you are trying to achieve it is confusing and does not predict well.
the main line you need to pay attention to is when you assign something to your variable value.
for example:
value = {items[0]:items[1]}
should be something like:
value = ({feature_name:feature}, label)
Then afterwards you would call .append() on your lists to add each value instead of .update().
You can look at an example of your updated code in a buggy working state at http://pastebin.com/91Zu59Cm but I would suggest thinking about the following:
How is the data supposed to be represented for the NaiveBayesClassifier class?
What features are you trying to capture?
What labels are associated with those features?

Related

Text vocabulary coverage using sklearn

I have a trained sklearn's CountVectorizer object on some corpus. When vectorizing a new document, the vector contains only the tokens that appear in the vectorizer's vocabulary.
I'd like to add another feature to the vector which is the vocabulary coverage, or in other words, the percentage of tokens that are in the vocabulary.
Here's my code:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"good morning sunshine",
"hello world",
"hello sunshine",
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus)
def get_vocab_coverage(vectorizer, sent):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
count = sum(w in vectorizer.vocabulary_ for w in tokenized_license)
return count / len(tokenized_license)
get_vocab_coverage(vectorizer, "hello world") # => 1.0
get_vocab_coverage(vectorizer, "hello to you") # => 0.333
The problem with this code is that it's not very pythonic, it uses an internal sklearn's variable and is not very scalable. Any idea how I can improve it? or is there an existing method that does the same?

transform method of CountVectorizer may be useful:
def get_vocab_coverage(vectorizer, sent: str):
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
processed = preprocessor(sent)
tokenized_license = tokenizer(processed)
if len(tokenized_license):
return vectorizer.transform([sent]).sum() / len(tokenized_license)
return 0
Note wrapping sample string to list before passing to transform (because it expects batch of documents, not one string), and also handling len == 0 case (no need to catch error dividing by zero). Few things to point out:
method transform returns sparse vector, so sum works fast
CountVectorizer make no advantage of passing documents by batches to transform, inside there is ordinary for-cycle, so function get_vocab_coverage written for processing one example is enough
transform expects raw documents, so function tokenize one example twice (explicitly while creating tokenized_license and implicitly inside transform), and it seems there is no simple way to avoid it (only return to previous version); if that's critical consider rewriting _count_vocab method with preprocessed_documents argument instead of raw_documents

Solving memory issues when using Gensim LDA Multicore

For my project I am trying to use unsupervised learning to identify different topics from application descriptions, but I am running into a strange problem. Firstly, I have 3 different datasets, one with 15k documents another with 50k documents and last with 2m documents. I am trying to test models with different number of topics (k) ranging from 5 to 100 with a step size of 5. This is in order to check which k results in the best model assessed with initially with the highest coherence score. For each k, I also build 3 different models with chunksize 10, 100 and 1000.
So now moving onto the problem I am having. Obviously my own machine is too slow and does not have enough cores for this kind of computation hence I am using my university's server. The problem here is my program seems to be consuming too much memory and I am unsure of the reason. I already made some adjustments such that the corpus is not loaded entirely to memory (or atleast I think I did). The dataset with 50k entries already at iteration k=50 (so halfway) seems to have consumed the alloted 100GB of memory, which seems very huge.
I would appreciate any help in the right direction and thanks for taking the time to look at this. Below is the code from my topic_modelling.py file. Comments on the file are a bit outdated, sorry about that.
class MyCorpus:
texts: list
dictionary: dict
def __init__(self, descriptions, dictionary):
self.texts = descriptions
self.dictionary = dictionary
def __iter__(self):
for line in self.texts:
try:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line)
except StopIteration:
pass
# Function given a dataframe creates a dictionary and corupus
# These are used to create an LDA model. Here we automatically use the Descriptionb column
# from each dataframe
def create_dict_and_corpus(df):
text_descriptions = remove_characters_and_create_list(df, 'Description')
# print(text_descriptions)
dictionary = gensim.corpora.Dictionary(text_descriptions)
corpus = MyCorpus(text_descriptions, dictionary)
return text_descriptions, dictionary, corpus
# Given a dataframe remove and a column name in the data frame, extract all words and return a list
# Also to remove all chracters that are not alphanumeric or spaces
def remove_characters_and_create_list(df, column_name, split=True):
df[column_name] = df[column_name].astype(str)
texts = []
for x in range(df[column_name].size):
current_string = df[column_name][x]
filtered_string = re.sub(r'[^A-Za-z0-9 ]+', '', current_string)
if split:
texts.append(filtered_string.split())
else:
texts.append(filtered_string)
return texts
# This function given the parameters creates an LDA model for each number between
# the start limit and the end limit. After this the coherence and perplexity is calulated
# for each of those models and saved in a csv file to analyze later.
def test_lda_models(text, corpus, dictionary, start_limit, end_limit, path):
results = []
print("============Starting topic modelling============")
for k in range(start_limit, end_limit+1, 5):
for p in range(1, 4):
chunk = pow(10, p)
t0 = time.time()
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=chunk)
# To calculate the goodness of the model
perplexity = lda_model.bound(corpus)
coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
t1 = time.time()
print(f"=====Done K={k} model with passes={p} and chunksize={chunk}, took {t1-t0} seconds=====")
results.append((k, chunk, coherence_lda, perplexity))
# Storing teh results in a csv file except the actual lda model (this would not make sense)
path = make_dir_if_not_exists(path)
list_tuples_to_csv(results, ['#OfTopics', 'ChunkSize', 'CoherenceScore', 'Perplexity'], f"{path}/K={start_limit}to{end_limit}.csv")
return results
# Function plot the visualization of an LDA model. This visualization is then
# saved as an html file inside the given path
def single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path):
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, f"{path}/visualization.html")
# Given the results produced by test_lda_models, loop though the models and save the
# topic words of each model and the visualization of the topics in the given path
def save_lda_result(k, c, lda_model, corpus, dictionary, path):
list_tuples_to_csv(lda_model.print_topics(num_topics=k), ['Topic#', 'Associated Words'], f"{path}/associated_words.csv")
single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path)
# This is the entire pipeline that needs to be performed for a single dataset,
# which includes computing the LDA models from start to end limit and calculating
# and saving the topic words and visual graphs for the top n topics with the highest
# coherence score.
def perform_topic_modelling_single_df(df, start_limit, end_limit, path):
# Extracting the necessary data required for LDA model computation
text_descriptions,dictionary, corpus = create_dict_and_corpus(df)
results_lda = test_lda_models(text_descriptions, corpus, dictionary, start_limit, end_limit, path)
# Sorting the results based on the 2nd tuple value returned which is 'coherence'
results_lda.sort(key=lambda x:x[2],reverse=True)
# Getting the top 5 results to save pass to save_lda_results function
results = results_lda[:5]
corpus_for_saving = [dictionary.doc2bow(text) for text in text_descriptions]
texts = remove_characters_and_create_list(df, 'Description', split=False)
# Perfrom application to topic modelling for the best lda model based on the
# coherence score (TODO maybe test with other lda models?)
print("getting descriptions for csv")
for k, c, _, _ in results:
dir_path = make_dir_if_not_exists(f"{path}/k={k}_chunk={c}")
p = int(math.log10(c))
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=c)
print(f"=====REDOING K={k} model with passes={p} and chunksize={c}=====")
save_lda_result(k,c, lda_model, corpus_for_saving, dictionary, dir_path)
application_to_topic_modelling(df, k, c, lda_model, corpus_for_saving, texts, dir_path)
# Performs the whole topic modelling pipeline taking different genre data sets
# and the entire dataset as a whole
def perform_topic_modelling_pipeline(path_ex):
# entire_df = pd.read_csv("../data/preprocessed_data/preprocessed_10000_trial.csv")
entire_df = pd.read_csv(os.path.join(ROOT_DIR, f"data/preprocessed_data/preprocessedData_{path_ex}.csv"))
print("size of df")
print(entire_df.shape)
# For entire df go from start limit to ngenres to find best LDA model
nGenres = row_counter(os.path.join(ROOT_DIR, f"data/genre_wise_data/data{path_ex}/genre_frequency.csv"))
nGenres_rounded = math.ceil(nGenres / 5) * 5
print(f"Original number of genres should be {nGenres}, but we are rounding to {nGenres_rounded}")
path = make_dir_if_not_exists(os.path.join(ROOT_DIR, f"results/data{path_ex}/aall_data"))
perform_topic_modelling_single_df(entire_df, 5, 100, path)

NaiveBayes model training with separate training set and data using pyspark

So, I am trying to train a naive bayes clasifier. Went into a lot of trouble of preprocessing the data and I have now produced two RDDs:
Traininng set: composed of a set of sparse-vectors;
Labels: a corresponding list of labels (0,1) for every vector.
I need to run something like this:
# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)
but "training" is a dataset derived from running:
def parseLine(line):
parts = line.split(',')
label = float(parts[0])
features = Vectors.dense([float(x) for x in parts[1].split(' ')])
return LabeledPoint(label, features)
data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)
based on the documentation for python here. My question is, given that I don't want to load the data from a txt file and that I have already created the training set in the form of records mapped to sparse-vectors (RDD) and a corresponding labelled list, how can I run naive-bayes?
Here is part of my code:
# Function
def featurize(tokens_kv, dictionary):
"""
:param tokens_kv: list of tuples of the form (word, tf-idf score)
:param dictionary: list of n words
:return: sparse_vector of size n
"""
# MUST sort tokens_kv by key
tokens_kv = collections.OrderedDict(sorted(tokens_kv.items()))
vector_size = len(dictionary)
non_zero_indexes = []
index_tfidf_values = []
for key, value in tokens_kv.iteritems():
index = 0
for word in dictionary:
if key == word:
non_zero_indexes.append(index)
index_tfidf_values.append(value)
index += 1
print non_zero_indexes
print index_tfidf_values
return SparseVector(vector_size, non_zero_indexes, index_tfidf_values)
# Feature Extraction
Training_Set_Vectors = (TFsIDFs_Vector_Weights_RDDs
.map(lambda (tokens): featurize(tokens, Dictionary_BV.value))
.cache())
... and labels is just a list of 1s and 0s. I understand that I may need to somehow use labeledpoint somehow but I am confused a to how... RDDs are not a list while labels is a list am hoping for something as simple as coming up with a way to create labeledpoint objets[i] combining sparse-vectors[i],corresponding-labels[i] respective values... any ideas?

I was able to solve this by first collecting the SparseVectors RDDs - effectively converting them to a list. Then, I run a function that constructed a list of
labelledpoint objects:
def final_form_4_training(SVs, labels):
"""
:param SVs: List of Sparse vectors.
:param labels: List of labels
:return: list of labeledpoint objects
"""
to_train = []
for i in range(len(labels)):
to_train.append(LabeledPoint(labels[i], SVs[i]))
return to_train
# Feature Extraction
Training_Set_Vectors = (TFsIDFs_Vector_Weights_RDDs
.map(lambda (tokens): featurize(tokens, Dictionary_BV.value))
.collect())
raw_input("Generate the LabeledPoint parameter... ")
labelled_training_set = sc.parallelize(final_form_4_training(Training_Set_Vectors, training_labels))
raw_input("Train the model... ")
model = NaiveBayes.train(labelled_training_set, 1.0)
However, this assumes that the RDDs maintain their order (with which I am not messing with) throughout the process pipeline. I also hate the part where I had to collect everything on the master. Any better ideas?

How to correct my Naive Bayes method returning extremely small conditional probabilities?

I'm trying to calculate the probability that an email is spam with Naive Bayes. I have a document class to create the documents (fed in from a website), and another class to train and classify documents. My train function calculates all the unique terms in all the documents, all documents in the spam class, all documents in the non-spam class, computes prior probabilities (one for spam, another for ham). Then I use the following formula to store conditional probabilities for each term into a dict
Tct = the number of occurances of a term in a given class
Tct' is the # terms in terms in a given class
B' = # unique terms in all documents
classes = either spam or ham
spam = spam, ham = not spam
the issue is that when I use this formula in my code it gives me extremely small conditional probability scores such as 2.461114392596968e-05. I'm quite sure this is because the values for Tct are very small (like 5 or 8) compared to the denominator values of Tct' (which is 64878 for ham and 308930 for spam) and B' (which is 16386). I can't figure out how to get the condprob scores down to something like .00034155, as I can only assume my condprob scores aren't supposed to be as exponentially small as they are. Am I doing something wrong with my calculations? Are the values actually supposed to be this small?
If it helps, my goal is to score a test set of documents and get results like 327.82, 758.80, or 138.66
using this formula
however, using my small condprob values I only get negative numbers.
Code
-Create Document
class Document(object):
"""
The instance variables are:
filename....The path of the file for this document.
label.......The true class label ('spam' or 'ham'), determined by whether the filename contains the string 'spmsg'
tokens......A list of token strings.
"""
def __init__(self, filename=None, label=None, tokens=None):
""" Initialize a document either from a file, in which case the label
comes from the file name, or from specified label and tokens, but not
both.
"""
if label: # specify from label/tokens, for testing.
self.label = label
self.tokens = tokens
else: # specify from file.
self.filename = filename
self.label = 'spam' if 'spmsg' in filename else 'ham'
self.tokenize()
def tokenize(self):
self.tokens = ' '.join(open(self.filename).readlines()).split()
-NaiveBayes
class NaiveBayes(object):
def train(self, documents):
"""
Given a list of labeled Document objects, compute the class priors and
word conditional probabilities, following Figure 13.2 of your
book. Store these as instance variables, to be used by the classify
method subsequently.
Params:
documents...A list of training Documents.
Returns:
Nothing.
"""
###TODO
unique = []
proxy = []
proxy2 = []
proxy3 = []
condprob = [{},{}]
Tct = defaultdict()
Tc_t = defaultdict()
prior = {}
count = 0
oldterms = []
old_terms = []
for a in range(len(documents)):
done = False
for item in documents[a].tokens:
if item not in unique:
unique.append(item)
if documents[a].label == "ham":
proxy2.append(item)
if done == False:
count += 1
elif documents[a].label == "spam":
proxy3.append(item)
done = True
V = unique
N = len(documents)
print("N:",N)
LB = len(unique)
print("THIS IS LB:",LB)
self.V = V
print("THIS IS COUNT/NC", count)
Nc = count
prior["ham"] = Nc / N
self.prior = prior
Nc = len(documents) - count
print("THIS IS SPAM COUNT/NC", Nc)
prior["spam"] = Nc / N
self.prior = prior
text2 = proxy2
text3 = proxy3
TctTotal = len(text2)
Tc_tTotal = len(text3)
print("THIS IS TCTOTAL",TctTotal)
print("THIS IS TC_TTOTAL",Tc_tTotal)
for term in text2:
if term not in oldterms:
Tct[term] = text2.count(term)
oldterms.append(term)
for term in text3:
if term not in old_terms:
Tc_t[term] = text3.count(term)
old_terms.append(term)
for term in V:
if term in text2:
condprob[0].update({term: (Tct[term] + 1) / (TctTotal + LB)})
if term in text3:
condprob[1].update({term: (Tc_t[term] + 1) / (Tc_tTotal + LB)})
print("This is condprob", condprob)
self.condprob = condprob
def classify(self, documents):
""" Return a list of strings, either 'spam' or 'ham', for each document.
Params:
documents....A list of Document objects to be classified.
Returns:
A list of label strings corresponding to the predictions for each document.
"""
###TODO
#return list["string1", "string2", "stringn"]
# docs2 = ham, condprob[0] is ham
# docs3 = spam, condprob[1] is spam
unique = []
ans = []
hscore = 0
sscore = 0
for a in range(len(documents)):
for item in documents[a].tokens:
if item not in unique:
unique.append(item)
W = unique
hscore = math.log(float(self.prior['ham']))
sscore = math.log(float(self.prior['spam']))
for t in W:
try:
hscore += math.log(self.condprob[0][t])
except KeyError:
continue
try:
sscore += math.log(self.condprob[1][t])
except KeyError:
continue
print("THIS IS SSCORE",sscore)
print("THIS IS HSCORE",hscore)
unique = []
if hscore > sscore:
str = "Spam"
elif sscore > hscore:
str = "Ham"
ans.append(str)
return ans
-Test
if not os.path.exists('train'): # download data
from urllib.request import urlretrieve
import tarfile
urlretrieve('http://cs.iit.edu/~culotta/cs429/lingspam.tgz', 'lingspam.tgz')
tar = tarfile.open('lingspam.tgz')
tar.extractall()
tar.close()
train_docs = [Document(filename=f) for f in glob.glob("train/*.txt")]
test_docs = [Document(filename=f) for f in glob.glob("test/*.txt")]
test = train_docs
nb = NaiveBayes()
nb.train(train_docs[1500:])
#uncomment when testing classify()
#predictions = nb.classify(test_docs[:200])
#print("PREDICTIONS",predictions)
The eventual goal is to be able to classify documents as spam or ham, but I want to work on the conditional probability issue first.
The Issue
Are the conditional probability values supposed to be this small? if so, why am I getting strange scores via classify? If not, how do I fix my code to give me the proper condprob values?
Values
The current condprob values that I am getting are along the lines of this:
'tradition': 2.461114392596968e-05, 'fillmore': 2.461114392596968e-05, '796': 2.461114392596968e-05, 'zann': 2.461114392596968e-05
condprob is a list containing two dictionaries, the first is ham and the next is spam. Each dictionary maps a term to it's conditional probability. I want to have "normal" small values such as .00031235 not 3.1235e-05.
The reason for this is that when I run the condprob values through the classify method with some test documents I get scores like
THIS IS HSCORE -2634.5292392650663, THIS IS SSCORE -1707.983339196181
when they should look like
THIS IS HSCORE 327.82, THIS IS SSCORE 758.80
Running Time
~1 min, 30 sec

(You seem to be working with log probabilities, which is very sensible, but I am going to write most of the following for the raw probabilities, which you could get by taking the exponential of the log probabilities, because it makes the algebra easier even if it does in practice mean that you would probably get numerical underflow if you didn't use logs)
As far as I can tell from your code you start with prior probabilities p(Ham) and p(Spam) and then use probabilities estimated from previous data to work out p(Ham) * p(Observed data | Ham) and p(Spam) * p(Observed data | Spam).
Bayes Theorem rearranges p(Obs|Spam) = p(Obs & Spam) / p(Spam) = p(Obs) p(Spam|Obs) / p(Spam) to give you P(Spam|Obs) = p(Spam) p(Obs|Spam)/p(Obs) and you seem to have calculated p(Spam) p(Obs|Spam) = p(Obs & Spam) but not divided by p(Obs). Since there are only two possibilities, Ham and Spam, the easiest thing to do is probably to note that p(Obs) = p(Obs & Spam) + p(Obs & Ham) and so just divide each of your two calculated values by their sum, essentially scaling the values so that they do indeed sum to 1.0.
This scaling is trickier if you start off with log probabilities lA and lB. To scale these I would first of all bring them into range by scaling them both by a rough value as logarithms, so doing a subtraction
lA = lA - max(lA, lB)
lB = lB - max(lA, lB)
Now at least the larger of the two won't overflow. The smaller still might, but I'd rather deal with underflow than overflow. Now turn them into not quite scaled probabilities:
pA = exp(lA)
pB = exp(lB)
and scale properly so they add to zero
truePA = pA / (pA + pB)
truePB = pB / (pA + pB)

Python nltk classify with large feature set (Replicate Go Et Al 2009)

I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?

Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sentiment Classification with NLTK Naive Baysian classifier - python

Related

Text vocabulary coverage using sklearn

Solving memory issues when using Gensim LDA Multicore

NaiveBayes model training with separate training set and data using pyspark

How to correct my Naive Bayes method returning extremely small conditional probabilities?

Python nltk classify with large feature set (Replicate Go Et Al 2009)

Categories

Resources