I'm analyzing a book of 1167 pages (txt file). So far I've done preprocessing of the data (cleaning, removing punctuation, stop word removing, tokenization).
Now How can I cluster the text and visualize the similarity (plot the cluster)
For example -
text1 = "there is one unshakable conviction that people—whatever the degree of development of their understanding and whatever the form taken by the factors present in their individuality for engendering all kinds of ideal"
text2 = "That is why I now also, in setting forth on this venture quite new for me, namely authorship, begin by pronouncing this invocation"
In my task, I divided whole book into chapters. tex1 is chapter one text2 is chapter 2 and so on. Now I wanna compare chapter 1 and 2 like this.
Example code that I used for data pre processing:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(pages[1])
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
Here I have m trying to read the content let's say 'book1.txt' and here I have to remove all the special characters and punctuation marks and word tokenise the content using nltk's word tokeniser.
Lemmatize those token using wordnetLemmatizer
And write those token into csv file one by one.
Here is the code I m using which obviously is not working but just need some suggestion on this please.
import nltk
from nltk.stem import WordNetLemmatizer
import csv
from nltk.tokenize import word_tokenize
file_out=open('data.csv','w')
with open('book1.txt','r') as myfile:
for s in myfile:
words = nltk.word_tokenize(s)
words=[word.lower() for word in words if word.isalpha()]
for word in words:
token=WordNetLemmatizer().lemmatize(words,'v')
filtered_sentence=[""]
for n in words:
if n not in token:
filtered_sentence.append(""+n)
file_out.writelines(filtered_sentence+["\n"])
There's some issues here, most notably with the last 2 for loops.
The way you are doing it made it write it as follows:
word1
word1word2
word1word2word3
word1word2word3word4
........etc
I'm guessing that is not the expected output. I'm assuming the expected output is:
word1
word2
word3
word4
........etc (without creating duplicates)
I applied the code below to a 3 paragraph Cat Ipsum file. Note that I changed some variable names due to my own naming conventions.
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from pprint import pprint
# read the text into a single string.
with open("book1.txt") as infile:
text = ' '.join(infile.readlines())
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
# create the lemmatized word list
results = []
for word in words:
# you were using words instead of word below
token = WordNetLemmatizer().lemmatize(word, "v")
# check if token not already in results.
if token not in results:
results.append(token)
# sort results, just because :)
results.sort()
# print and save the results
pprint(results)
print(len(results))
with open("nltk_data.csv", "w") as outfile:
outfile.writelines(results)
I'm building a NLP chat application using Doc2Vec technique in Python using its gensim package. I have already done tokenizing and stemming. I want to remove the stop words (to test if it works better) from both the training set as well as the question which user throws.
Here is my code.
import gensim
import nltk
from gensim import models
from gensim import utils
from gensim import corpora
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence0 = models.doc2vec.LabeledSentence(words=[u'sampl',u'what',u'is'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'sampl',u'tell',u'me',u'about'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'elig',u'what',u'is',u'my'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'limit', u'what',u'is',u'my'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'claim',u'how',u'much',u'can',u'I'],tags=["SENT_4"])
sentence5 = models.doc2vec.LabeledSentence(words=[u'retir',u'i',u'am',u'how',u'much',u'can',u'elig',u'claim'],tags=["SENT_5"])
sentence6 = models.doc2vec.LabeledSentence(words=[u'resign',u'i',u'have',u'how',u'much',u'can',u'i',u'claim',u'elig'],tags=["SENT_6"])
sentence7 = models.doc2vec.LabeledSentence(words=[u'promot',u'what',u'is',u'my',u'elig',u'post',u'my'],tags=["SENT_7"])
sentence8 = models.doc2vec.LabeledSentence(words=[u'claim',u'can,',u'i',u'for'],tags=["SENT_8"])
sentence9 = models.doc2vec.LabeledSentence(words=[u'product',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_9"])
sentence10 = models.doc2vec.LabeledSentence(words=[u'hotel',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_10"])
sentence11 = models.doc2vec.LabeledSentence(words=[u'onlin',u'product',u'can',u'i',u'for',u'bought',u'through',u'claim',u'sampl'],tags=["SENT_11"])
sentence12 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'guidelin',u'where',u'do',u'i',u'apply',u'form',u'sampl'],tags=["SENT_12"])
sentence13 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'procedur',u'rule',u'and',u'regul',u'what',u'is',u'the',u'for'],tags=["SENT_13"])
sentence14 = models.doc2vec.LabeledSentence(words=[u'can',u'i',u'submit',u'expenditur',u'on',u'behalf',u'of',u'my',u'friend',u'and',u'famili',u'claim',u'and',u'reimburs'],tags=["SENT_14"])
sentence15 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'procedur',u'can',u'i',u'submit',u'from',u'shopper stop',u'claim'],tags=["SENT_15"])
sentence16 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'can',u'i',u'submit',u'from',u'pantaloon',u'claim'],tags=["SENT_16"])
sentence17 = models.doc2vec.LabeledSentence(words=[u'invoic',u'procedur',u'can',u'i',u'submit',u'invoic',u'from',u'spencer',u'claim'],tags=["SENT_17"])
# User asks a question.
document = input("Ask a question:")
tokenized_document = list(gensim.utils.tokenize(document, lowercase = True, deacc = True))
#print(type(tokenized_document))
stemmed_document = []
for w in tokenized_document:
stemmed_document.append(ps.stem(w))
sentence19 = models.doc2vec.LabeledSentence(words= stemmed_document, tags=["SENT_19"])
# Building vocab.
sentences = [sentence0,sentence1,sentence2,sentence3, sentence4, sentence5,sentence6, sentence7, sentence8, sentence9, sentence10, sentence11, sentence12, sentence13, sentence14, sentence15, sentence16, sentence17, sentence19]
#I tried to remove the stop words but it didn't work out as LabeledSentence object has no attribute lower.
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in sentences]
..
Is there a way I can remove stop words from sentences directly and get a new set of vocab without stop words ?
Your sentences object is a already a list of LabeledSentence objects. You construct these above; they include a list-of-strings in words and a list-of-strings in tags.
So each item in that list (document in your list-comprehension) can't have a string method like .lower() applied to it. (Nor would it need to be .split(), as its words are already separate tokens.)
The cleanest approach would be to remove stop-words from the lists-of-words before they're used to construct LabeledSentence objects. For example, you could make a function without_stopwords(), defined at the top. Then your lines creating LabeledSentence objects could instead be like:
sentence0 = LabeledSentence(words=remove_stopwords([u'sampl', u'what', u'is']),
tags=["SENT_0"])
Alternatively, you could mutate the existing LabeledSentence objects so that each of their words attributes now lack stop-words. This would replace your last line with something more like:
for doc in sentences:
doc.words = [word for word in doc.words if word not in stoplist]
texts = sentences
Separately, things you didn't ask but should know:
TaggedDocument is now the preferred example-class name for Doc2Vec text objects – but in fact any object that has the two required properties words and tags will work fine.
Doc2Vec doesn't show many of the desired properties on tiny, toy-sized datasets – don't be surprised if a model built on dozens of sentences does not do anything useful, or misleads about what preprocessing/meta-parameter options are best. (Tens of thousands of texts, and texts at least tens-of-words long, are much better for meaningful results.)
Much Word2Vec/Doc2Vec work doesn't bother with stemming or stop-word removal, but it may sometimes be helpful.
I have a list of processed text files, that looks somewhat like this:
text = "this is the first text document " this is the second text document " this is the third document "
I've been able to successfully tokenize the sentences:
sentences = sent_tokenize(text)
for ii, sentence in enumerate(sentences):
sentences[ii] = remove_punctuation(sentence)
sentence_tokens = [word_tokenize(sentence) for sentence in sentences]
And now I would like to remove stopwords from this list of tokens. However, because it's a list of sentences within a list of text documents, I can't seem to figure out how to do this.
This is what I've tried so far, but it returns no results:
sentence_tokens_no_stopwords = [w for w in sentence_tokens if w not in stopwords]
I'm assuming achieving this will require some sort of for loop, but what I have now isn't working. Any help would be appreciated!
You can create two lists generators like that:
sentence_tokens_no_stopwords = [[w for w in s if w not in stopwords] for s in sentence_tokens ]
I am trying to filter out stopwords in my text like so:
clean = ' '.join([word for word in text.split() if word not in (stopwords)])
The problem is that text.split() has elements like 'word.' that don't match to the stopword 'word'.
I later use clean in sent_tokenize(clean), however, so I don't want to get rid of the punctuation altogether.
How do I filter out stopwords while retaining punctuation, but filtering words like 'word.'?
I thought it would be possible to change the punctuation:
text = text.replace('.',' . ')
and then
clean = ' '.join([word for word in text.split() if word not in (stop words)] or word == ".")
But is there a better way?
Tokenize the text first, than clean it from stopwords. A tokenizer usually recognizes punctuation.
import nltk
text = 'Son, if you really want something in this life,\
you have to work for it. Now quiet! They are about\
to announce the lottery numbers.'
stopwords = ['in', 'to', 'for', 'the']
sents = []
for sent in nltk.sent_tokenize(text):
tokens = nltk.word_tokenize(sent)
sents.append(' '.join([w for w in tokens if w not in stopwords]))
print sents
['Son , if you really want something this life , you have work it .', 'Now quiet !', 'They are about announce lottery numbers .']
You could use something like this:
import re
clean = ' '.join([word for word in text.split() if re.match('([a-z]|[A-Z])+', word).group().lower() not in (stopwords)])
This pulls out everything except lowercase and uppercase ascii letters and matches it to words in your stopcase set or list. Also, it assumes that all of your words in stopwords are lowercase, which is why I converted the word to all lowercase. Take that out if I made to great of an assumption
Also, I'm not proficient in regex, sorry if there's a cleaner or robust way of doing this.