Using data set for training and testing in NLTK [duplicate] - python

The following code run Naive Bayes movie review classifier.
The code generate a list of the most informative features.
Note: **movie review** folder is in the nltk.
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
stop = stopwords.words('english')
documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
link of code from alvas
how can I test the classifier on specific file?
Please let me know if my question is ambiguous or wrong.

First, read these answers carefully, they contain parts of the answers you require and also briefly explains what the classifier does and how it works in NLTK:
nltk NaiveBayesClassifier training for sentiment analysis
Using my own corpus instead of movie_reviews corpus for Classification in NLTK
http://www.nltk.org/book/ch06.html
Testing classifier on annotated data
Now to answer your question. We assume that your question is a follow-up of this question: Using my own corpus instead of movie_reviews corpus for Classification in NLTK
If your test text is structured the same way as the movie_review corpus, then you can simply read the test data as you would for the training data:
Just in case the explanation of the code is unclear, here's a walkthrough:
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
The two lines above is to read a directory my_movie_reviews with such a structure:
\my_movie_reviews
\pos
123.txt
234.txt
\neg
456.txt
789.txt
README
Then the next line extracts documents with its pos/neg tag that's part of the directory structure.
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
Here's the explanation for the above line:
# This extracts the pos/neg tag
labels = [i for i.split('/')[0]) for i in mr.fileids()]
# Reads the words from the corpus through the CategorizedPlaintextCorpusReader object
words = [w for w in mr.words(i)]
# Removes the stopwords
words = [w for w in mr.words(i) if w.lower() not in stop]
# Removes the punctuation
words = [w for w in mr.words(i) w not in string.punctuation]
# Removes the stopwords and punctuations
words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation]
# Removes the stopwords and punctuations and put them in a tuple with the pos/neg labels
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
The SAME process should be applied when you read the test data!!!
Now to the feature processing:
The following lines extra top 100 features for the classifier:
# Extract the words features and put them into FreqDist
# object which records the no. of times each unique word occurs
word_features = FreqDist(chain(*[i for i,j in documents]))
# Cuts the FreqDist to the top 100 words in terms of their counts.
word_features = word_features.keys()[:100]
Next to processing the documents into classify-able format:
# Splits the training data into training size and testing size
numtrain = int(len(documents) * 90 / 100)
# Process the documents for training data
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
# Process the documents for testing data
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
Now to explain that long list comprehension for train_set and `test_set:
# Take the first `numtrain` no. of documents
# as training documents
train_docs = documents[:numtrain]
# Takes the rest of the documents as test documents.
test_docs = documents[numtrain:]
# These extract the feature sets for the classifier
# please look at the full explanation on https://stackoverflow.com/questions/20827741/nltk-naivebayesclassifier-training-for-sentiment-analysis/
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in train_docs]
You need to process the documents as above for the feature extractions in the test documents too!!!
So here's how you can read the test data:
stop = stopwords.words('english')
# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
# Now do the same for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
Then continue with the processing steps described above, and simply do this to get the label for the test document as #yvespeirsman answered:
#### FOR TRAINING DATA ####
stop = stopwords.words('english')
# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
# Extract training features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Assuming that you're using full data set
# since your test set is different.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
#### TRAINS THE TAGGER ####
# Train the tagger
classifier = NaiveBayesClassifier.train(train_set)
#### FOR TESTING DATA ####
# Now do the same reading and processing for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
# Reads test data into features:
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in test_documents]
#### Evaluate the classifier ####
for doc, gold_label in test_set:
tagged_label = classifier.classify(doc)
if tagged_label == gold_label:
print("Woohoo, correct")
else:
print("Boohoo, wrong")
If the above code and explanation makes no sense to you, then you MUST read this tutorial before proceeding: http://www.nltk.org/howto/classify.html
Now let's say you have no annotation in your test data, i.e. your test.txt is not in the directory structure like the movie_review and just a plain textfile:
\test_movie_reviews
\1.txt
\2.txt
Then there's no point in reading it into a categorized corpus, you can simply do read and tag the documents, i.e.:
for infile in os.listdir(`test_movie_reviews):
for line in open(infile, 'r'):
tagged_label = classifier.classify(doc)
BUT you CANNOT evaluate the results without annotation, so you can't check the tag if the if-else, also you need to tokenize your text if you're not using the CategorizedPlaintextCorpusReader.
If you just want to tag a plaintext file test.txt:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk import word_tokenize
stop = stopwords.words('english')
# Extracts the documents.
documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]
# Extract the features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Converts documents to features.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
# Train the classifier.
classifier = NaiveBayesClassifier.train(train_set)
# Tag the test file.
with open('test.txt', 'r') as fin:
for test_sentence in fin:
# Tokenize the line.
doc = word_tokenize(test_sentence.lower())
featurized_doc = {i:(i in doc) for i in word_features}
tagged_label = classifier.classify(featurized_doc)
print(tagged_label)
Once again, please don't just copy and paste the solution and try to understand why and how it works.

You can test on one file with classifier.classify(). This method takes as its input a dictionary with the features as its keys, and True or False as their values, depending on whether the feature occurs in the document or not. It outputs the most probable label for the file, according to the classifier. You can then compare this label with the correct label for the file to see if the classification is correct.
In your training and test sets, the feature dictionaries are always the first item in the tuples, the labels are the second item in the tuples.
Thus, you can classify the first document in the test set like so:
(my_document, my_label) = test_set[0]
if classifier.classify(my_document) == my_label:
print "correct!"
else:
print "incorrect!"

Related

Corpus analysis with python

I'm a new student of natural language processing and I have a task regarding simple corpus analysis. Given an input file (MovieCorpus.txt) we are assigned to compute the following statistics:
Number of sentences, tokens, types (lemmas)
Distribution of sentence length, types, POS
import nltk
import spacy as sp
from nltk import word_tokenize
# Setting Spacy Modelsp
nlp = sp.load('en_core_web_sm')
# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
read_data = f.read().splitlines()
# Tokenize, POS, Lemma
tokens = []
lemma = []
pos = []
for doc in nlp.pipe(read_data):
if doc.is_parsed:
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
pos.append([n.pos_ for n in doc])
else:
tokens.append(None)
lemma.append(None)
pos.append(None)
ls = len(read_data)
print("The amount of sentences is %d:" %ls)
lt = len(tokens)
print("The amount of tokens is %d:" %lt)
ll = len(lemma)
print("The amount of lemmas is %d:" %ll)
This is attempt at answering those questions but since the file is very large (>300.000 sentences) it takes forever to analyze. Is there anything I did wrong? Should I rather use NLTK instead of spacy?
import pandas as pd
import nltk
from nltk import word_tokenize
# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
read_data = f.read().splitlines()
df = pd.DataFrame({"text": read_data}) # Assuming your data has no header
data = data.head(10)
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
data['lemma'] = data.text.apply(lemmatize_text)
data["tokens"] = data.text.apply(nltk.word_tokenize)
data["posR"] = data.tokens.apply(lambda x: nltk.pos_tag(x))
tags = [[tag for word, tag in _] for _ in data["posR"].to_list()]
data["pos"] = tags
print(data)
From here on you should be able to do all other tasks by yourself.

Multilabel text classification/clustering in python with tf-idf

I have five plain text documents in a directory that are already clustered based on their content and named as such cluster1.txt, cluster2.txt and so on, so they are functioning as my corpus. Otherwise they don't have any labels, they are just named as such.
My task is to cluster a new text document with new sentences, but not the whole document as a whole, instead I should cluster each sentence into one of these 5 cluster or classes and also do a confusion matrix with the recall and precision score to show how similar the sentences are to the clusters.
I first tried to do it with a kNN and then a kmeans, but I think my logic is flawed since this is not a clustering problem, it's a classification problem, right?
Well at least I tried to preprocess the text (removing stop words, lemmatize, lowercasing, tokenizing) and then I calculated the termfrequency with a countvectorizer and then the tf-idf
I kinda have problems with the logic with this problem.
Anyway, this is what I tried so far, but now I'm kinda stuck, any help is appreciated
import glob
import os
file_list = glob.glob(os.path.join(os.getcwd(), 'C:/Users/ds191033/FH/Praktikum/Testdaten/Clusters', "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="utf8") as f_input:
corpus.append(f_input.read())
stopwords = nltk.corpus.stopwords.words('german')
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('german')
lem = WordNetLemmatizer()
def normalize_document(doc):
# lower case and remove special characters\whitespaces
doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
doc = doc.lower()
tokens = wpt.tokenize(doc)
# filter stopwords out of document
filtered_tokens = [token for token in tokens if token not in stop_words]
#lemmatize
for w in filtered_tokens:
lemmatized_tokens = [lem.lemmatize(t) for t in filtered_tokens]
# re-create document from filtered tokens
doc = ' '.join(lemmatized_tokens)
return doc
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(corpus)
norm_corpus
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix = cv_matrix.toarray()
cv_matrix
# get all unique words in the corpus
vocab = cv.get_feature_names()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

sklearn TfidfVectorizer : How to make few words to only be part of bi gram in the features

I want the featurization of TfidfVectorizer to consider some predefined words such as "script", "rule", only to be used in bigrams.
If I have text "Script include is a script that has rule which has a business rule"
for the above text if I use
tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english')
I should get
['script include','business rule','include','business']
from sklearn.feature_extraction import text
# Given a vocabulary returns a filtered vocab which
# contain only tokens in include_list and which are
# not stop words
def filter_vocab(full_vocab, include_list):
b_list = list()
for x in full_vocab:
add = False
for t in x.split():
if t in text.ENGLISH_STOP_WORDS:
add = False
break
if t in include_list:
add = True
if add:
b_list.append(x)
return b_list
# Get all the ngrams (one can also use nltk.util.ngram)
ngrams = TfidfVectorizer(ngram_range=(1,2), norm=None, smooth_idf=False, use_idf=False)
X = ngrams.fit_transform(["Script include is a script that has rule which has a business rule"])
full_vocab = ngrams.get_feature_names()
# filter the full ngram based vocab
filtered_v = filter_vocab(full_vocab,["include", "business"])
# Get tfidf using the new filtere vocab
vectorizer = TfidfVectorizer(ngram_range=(1,2), vocabulary=filtered_v)
X = vectorizer.fit_transform(["Script include is a script that has rule which has a business rule"])
v = vectorizer.get_feature_names()
print (v)
Code is commented to explain what it is doing
Basically you are looking for customizing the n_grams creation based upon your special words (I call it as interested_words in the function). I have customized the default n_grams creation function for your purpose.
def custom_word_ngrams(tokens, stop_words=None, interested_words=None):
"""Turn tokens into a sequence of n-grams after stop words filtering"""
original_tokens = tokens
stop_wrds_inds = np.where(np.isin(tokens,stop_words))[0]
intersted_wrds_inds = np.where(np.isin(tokens,interested_words))[0]
tokens = [w for w in tokens if w not in stop_words+interested_words]
n_original_tokens = len(original_tokens)
# bind method outside of loop to reduce overhead
tokens_append = tokens.append
space_join = " ".join
for i in xrange(n_original_tokens - 1):
if not any(np.isin(stop_wrds_inds, [i,i+1])):
tokens_append(space_join(original_tokens[i: i + 2]))
return tokens
Now, we can plugin this function inside the usual analyzer of TfidfVectorizer, as following!
import numpy as np
from sklearn.externals.six.moves import xrange
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.feature_extraction import text
def analyzer():
base_vect = CountVectorizer()
stop_words = list(text.ENGLISH_STOP_WORDS)
preprocess = base_vect.build_preprocessor()
tokenize = base_vect.build_tokenizer()
return lambda doc: custom_word_ngrams(
tokenize(preprocess(base_vect.decode(doc))), stop_words, ['script', 'rule'])
#feed your special words list here
vectorizer = TfidfVectorizer(analyzer=analyzer())
vectorizer.fit(["Script include is a script that has rule which has a business rule"])
vectorizer.get_feature_names()
['business', 'business rule', 'include', 'script include']
TfidfVectorizer allows you to provide your own tokenizer, you can do something like below. But you will lose other words information in vocabulary.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Script include is a script that has rule which has a business rule"]
vectorizer = TfidfVectorizer(ngram_range=(1,2),tokenizer=lambda corpus: [ "script", "rule"],stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

computing cosine-similarity between all texts in a corpus

I have a set of documents stored in a JOSN file. Along this line, I retrieve them using the following code so that they are stored under the term data:
import json
with open('SDM_2015.json') as f:
data = [json.loads(line) for line in f]
Integrating all texts into a single one to form the corpus is done by:
corpus = []
for i in range(len(data) -1):
corpus.append(data[i]['body'] + data[i+1]['body'])
Until now pretty straightforward manipulations. To build the tfidf I use the following lines of codes which remove stop words and punctuation, stems each term and tokenize the data.
import nltk
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
# stemming each word (common root)
stemmer = nltk.stem.porter.PorterStemmer()
# removing puctuations etc
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
## First function that creates the tokens
def stem_tokens(tokens):
return [stemmer.stem(item) for item in tokens]
## Function that incorporating the first function, converts all words into lower letters and removes puctuations maps (previously specified)
def normalize(text):
return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))
## Lastly, a functionthat contains all the previous ones plus stopwords removal
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
I then try to apply this function to the corpus such:
tfidf = vectorizer.fit_transform(corpus)
print(((tfidf*tfidf.T).A)[0,1])
But nothing happens, any idea of how to proceed?
Kind regards

Train corpus of Tweets for Sentiment Analysis, using NLTK for Python

I'm trying to train my own corpora for sentiment analysis, using NLTK for python. I have two text files: one has 25K positive tweets, separated per line, the other one 25K negative tweets.
I use this Stackoverflow article, method 2
When I run this code to create corpora:
import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.corpus import CategorizedPlaintextCorpusReader
import nltk
mydir = 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(mydir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
classifier = nbc.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
I receive error message:
C:\Users\gerbuiker\Anaconda\python.exe "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py"
Traceback (most recent call last):
File "C:/Users/gerbuiker/Desktop/Sentiment Analyse/CORPUS_POS_NEG/CreateCorpus.py", line 23, in <module>
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
File "C:\Users\gerbuiker\AppData\Roaming\Python\Python27\site-packages\nltk\corpus\reader\util.py", line 336, in iterate_from
assert self._len is not None
AssertionError
Process finished with exit code 1
Does anyone know how to fix this?
I'm not 100% positive as I'm not on a Windows machine to test this at the moment, but I think what may be catching you up is the difference between the path slash direction in #alvas original example and your adaptation to windows.
Specifically, you use: 'C:\Users\gerbuiker\Desktop\Sentiment Analyse\my_movie_reviews' while his example uses '/home/alvas/my_movie_reviews'. For the most part this is fine, but you attempt to re-use his cat_pattern regex: r'(neg|pos)/.*' which will match the slash in his paths but reject the one in yours.

Categories