I'm building a NLP chat application using Doc2Vec technique in Python using its gensim package. I have already done tokenizing and stemming. I want to remove the stop words (to test if it works better) from both the training set as well as the question which user throws.
Here is my code.
import gensim
import nltk
from gensim import models
from gensim import utils
from gensim import corpora
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence0 = models.doc2vec.LabeledSentence(words=[u'sampl',u'what',u'is'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'sampl',u'tell',u'me',u'about'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'elig',u'what',u'is',u'my'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'limit', u'what',u'is',u'my'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'claim',u'how',u'much',u'can',u'I'],tags=["SENT_4"])
sentence5 = models.doc2vec.LabeledSentence(words=[u'retir',u'i',u'am',u'how',u'much',u'can',u'elig',u'claim'],tags=["SENT_5"])
sentence6 = models.doc2vec.LabeledSentence(words=[u'resign',u'i',u'have',u'how',u'much',u'can',u'i',u'claim',u'elig'],tags=["SENT_6"])
sentence7 = models.doc2vec.LabeledSentence(words=[u'promot',u'what',u'is',u'my',u'elig',u'post',u'my'],tags=["SENT_7"])
sentence8 = models.doc2vec.LabeledSentence(words=[u'claim',u'can,',u'i',u'for'],tags=["SENT_8"])
sentence9 = models.doc2vec.LabeledSentence(words=[u'product',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_9"])
sentence10 = models.doc2vec.LabeledSentence(words=[u'hotel',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_10"])
sentence11 = models.doc2vec.LabeledSentence(words=[u'onlin',u'product',u'can',u'i',u'for',u'bought',u'through',u'claim',u'sampl'],tags=["SENT_11"])
sentence12 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'guidelin',u'where',u'do',u'i',u'apply',u'form',u'sampl'],tags=["SENT_12"])
sentence13 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'procedur',u'rule',u'and',u'regul',u'what',u'is',u'the',u'for'],tags=["SENT_13"])
sentence14 = models.doc2vec.LabeledSentence(words=[u'can',u'i',u'submit',u'expenditur',u'on',u'behalf',u'of',u'my',u'friend',u'and',u'famili',u'claim',u'and',u'reimburs'],tags=["SENT_14"])
sentence15 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'procedur',u'can',u'i',u'submit',u'from',u'shopper stop',u'claim'],tags=["SENT_15"])
sentence16 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'can',u'i',u'submit',u'from',u'pantaloon',u'claim'],tags=["SENT_16"])
sentence17 = models.doc2vec.LabeledSentence(words=[u'invoic',u'procedur',u'can',u'i',u'submit',u'invoic',u'from',u'spencer',u'claim'],tags=["SENT_17"])
# User asks a question.
document = input("Ask a question:")
tokenized_document = list(gensim.utils.tokenize(document, lowercase = True, deacc = True))
stemmed_document = []
for w in tokenized_document:
sentence19 = models.doc2vec.LabeledSentence(words= stemmed_document, tags=["SENT_19"])
# Building vocab.
sentences = [sentence0,sentence1,sentence2,sentence3, sentence4, sentence5,sentence6, sentence7, sentence8, sentence9, sentence10, sentence11, sentence12, sentence13, sentence14, sentence15, sentence16, sentence17, sentence19]
#I tried to remove the stop words but it didn't work out as LabeledSentence object has no attribute lower.
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in sentences]
Is there a way I can remove stop words from sentences directly and get a new set of vocab without stop words ?
Your sentences object is a already a list of LabeledSentence objects. You construct these above; they include a list-of-strings in words and a list-of-strings in tags.
So each item in that list (document in your list-comprehension) can't have a string method like .lower() applied to it. (Nor would it need to be .split(), as its words are already separate tokens.)
The cleanest approach would be to remove stop-words from the lists-of-words before they're used to construct LabeledSentence objects. For example, you could make a function without_stopwords(), defined at the top. Then your lines creating LabeledSentence objects could instead be like:
sentence0 = LabeledSentence(words=remove_stopwords([u'sampl', u'what', u'is']),
Alternatively, you could mutate the existing LabeledSentence objects so that each of their words attributes now lack stop-words. This would replace your last line with something more like:
for doc in sentences:
doc.words = [word for word in doc.words if word not in stoplist]
texts = sentences
Separately, things you didn't ask but should know:
TaggedDocument is now the preferred example-class name for Doc2Vec text objects – but in fact any object that has the two required properties words and tags will work fine.
Doc2Vec doesn't show many of the desired properties on tiny, toy-sized datasets – don't be surprised if a model built on dozens of sentences does not do anything useful, or misleads about what preprocessing/meta-parameter options are best. (Tens of thousands of texts, and texts at least tens-of-words long, are much better for meaningful results.)
Much Word2Vec/Doc2Vec work doesn't bother with stemming or stop-word removal, but it may sometimes be helpful.
