NLTK-based stemming and lemmatization - python

I am trying to preprocess a string using lemmatizer and then remove the punctuation and digits. I am using the code below to do this. I am not getting any error but the text is not preprocessed appropriately. Only the stop words are removed but the lemmatizing does not work and punctuation and digits also remain.
from nltk.stem import WordNetLemmatizer
import string
import nltk
tweets = "This is a beautiful day16~. I am; working on an exercise45.^^^45 text34."
lemmatizer = WordNetLemmatizer()
tweets = lemmatizer.lemmatize(tweets)
data=[]
stop_words = set(nltk.corpus.stopwords.words('english'))
words = nltk.word_tokenize(tweets)
words = [i for i in words if i not in stop_words]
data.append(' '.join(words))
corpus = " ".join(str(x) for x in data)
p = string.punctuation
d = string.digits
table = str.maketrans(p, len(p) * " ")
corpus.translate(table)
table = str.maketrans(d, len(d) * " ")
corpus.translate(table)
print(corpus)
The final output I get is:
This beautiful day16~ . I ; working exercise45.^^^45 text34 .
And expected output should look like:
This beautiful day I work exercise text

No, your current approach does not work, because you must pass one word at a time to the lemmatizer/stemmer, otherwise, those functions won't know to interpret your string as a sentence (they expect words).
import re
__stop_words = set(nltk.corpus.stopwords.words('english'))
def clean(tweet):
cleaned_tweet = re.sub(r'([^\w\s]|\d)+', '', tweets.lower())
return ' '.join([lemmatizer.lemmatize(i, 'v')
for i in cleaned_tweet.split() if i not in __stop_words])
Alternatively, you can use a PorterStemmer, which does the same thing as lemmatisation, but without context.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
And, call the stemmer like this:
stemmer.stem(i)

I think this is what you're looking for, but do this prior to calling the lemmatizer as the commenter noted.
>>>import re
>>>s = "This is a beautiful day16~. I am; working on an exercise45.^^^45text34."
>>>s = re.sub(r'[^A-Za-z ]', '', s)
This is a beautiful day I am working on an exercise text

To process a tweet properly you can use following code:
import re
import nltk
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
""" Normalizes case and handles punctuation
Inputs:
text: str: raw text
lemmatizer: an instance of a class implementing the lemmatize() method
(the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
Outputs:
list(str): tokenized text
"""
bcd=[]
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
text1= text.lower()
text1= re.sub(pattern,"", text1)
text1= text1.replace("'s "," ")
text1= text1.replace("'","")
text1= text1.replace("—", " ")
table= str.maketrans(string.punctuation,32*" ")
text1= text1.translate(table)
geek= nltk.word_tokenize(text1)
abc=nltk.pos_tag(geek)
output = []
for value in abc:
value = list(value)
if value[1][0] =="N":
value[1] = 'n'
elif value[1][0] =="V":
value[1] = 'v'
elif value[1][0] =="J":
value[1] = 'a'
elif value[1][0] =="R":
value[1] = 'r'
else:
value[1]='n'
output.append(value)
abc=output
for value in abc:
bcd.append(lemmatizer.lemmatize(value[0],pos=value[1]))
return bcd
here I have use post_tag (only N,V,J,R and converted rest all into noun as well). This will return a tokenized and lemmatized list of words.

Related

Expected str instance, spacy.tokens.token.Token found

I am executing a data extraction use-case. To preprocess and tokenize my data, I am using both spacy English and German tokenizers, because the sentences are in both the languages. Here's my code:
import spacy
from spacy.lang.de import German
from spacy.lang.en import English
from spacy.lang.de import STOP_WORDS as stp_wrds_de
from spacy.lang.en.stop_words import STOP_WORDS as stp_wrds_en
import string
punctuations = string.punctuation
# German Parser
parser_de = German()
# English Parser
parser_en = English()
def spacy_tokenizer_de(document):
# Token object for splitting text into 'units'
tokens = parser_de(document)
# Lemmatization: Grammatical conversion of words
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
# Remove punctuations
tokens = [word for word in tokens if word not in punctuations]
tokens_de_str = converttostr(tokens,' ')
tokens_en = spacy_tokenizer_en(tokens_de_str)
print("Tokens EN: {}".format(tokens_en))
tokens_en = converttostr(tokens_en,' ')
return tokens_en
def converttostr(input_seq, separator):
# Join all the strings in list
final_str = separator.join(input_seq)
return final_str
def spacy_tokenizer_en(document):
tokens = parser_en(document)
tokens = [word.lemma_.strip() if word.lemma_ != '-PRON-' else word for word in tokens]
return tokens
Here's a further elucidation of the above code:
1. spacy_tokenizer_de(): Method to parse and tokenize document in German
2. spacy_tokenizer_en(): Method to parse and tokenize document in English
3. converttostr(): Converts list of tokens to a string, so that the English spacy tokenizer can read the input (only accepts document/string format) and tokenize the data.
However, some sentences when parsed, lead to the following error:
Why is a spacy token object coming up in such scenarios, whereas, some of the sentences are being processed successfully? Can anyone please help here?
token.lemma_.strip() if token.lemma_ != '-PRON-' else token.text for token in tokens
You're supposed to get a list of words here, right? Instead, sometimes you return a string (when lemma doesn't equal to '-PRON-') but other times just token but not a string.
You may get a string from token.text.

I'm getting TypeError: expected string or bytes-like object "

import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
paragraph = ''' State-run Bharat Sanchar Nigam Ltd (BSNL) is readying to pay November salary in another two days, which will be raised from internal accruals and bank loans.'''
sentence = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()
for i in range(len(sentence)):
words = nltk.word_tokenize(i)
words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
sentence[i] = ' '.join(words)
I am getting an error on this part
words = nltk.word_tokenize(i)
range() produces an iterable of integers. So, when you feed i into nltk.word_tokenize(), you're feeding it an integer. Obviously, an integer is not string-like.
I don't personally know how nltk.word_tokenize() is supposed to work, but based on context clues it would seem you might want to pass the sentence object at the index i instead of just the index i:
words = nltk.word_tokenize(sentence[i])

How to add more stopwords in nltk list?

I have the following code. I have to add more words in nltk stopword list. After i run thsi, it does not add the words in the list
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
new_words = open("stopwords_en.txt", "r")
new_stopwords = stop.union(new_word)
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in new_stopwords])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
doc_clean = [clean(doc).split() for doc in emails_body_text]
Don't do things blindly. Read in your new list of stopwords, inspect it to see that it's right, then add it to the other stopword list. Start with the code suggested by #greg_data, but you'll need to strip newlines and maybe do other things -- who knows what your stopwords file looks like?
This might do it, for example:
new_words = open("stopwords_en.txt", "r").read().split()
new_stopwords = stop.union(new_words)
PS. Don't keep splitting and joining your document; tokenize once and work with the list of tokens.

Stemming words with NLTK (python)

I am new to Python text processing, I am trying to stem word in text document, has around 5000 rows.
I have written below script
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
def Description_to_words(raw_Description ):
# 1. Remove HTML
Description_text = BeautifulSoup(raw_Description).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 5. stem words
words = ([stemmer.stem(w) for w in words])
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
clean_Description = Description_to_words(train["Description"][15])
But when I test results words were not stemmed , can anyone help me to know what is issue , I am doing something wrong in "Description_to_words" function
And, when I execute stem command separately like below it works.
from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>>
>>> for w in words:
... print(stemmer.stem(w))
...
mobil
app
-
unabl
to
add
read
Here's each step of your function, fixed.
Remove HTML.
Description_text = BeautifulSoup(raw_Description).get_text()
Remove non-letters, but don't remove whitespaces just yet. You can also simplify your regex a bit.
letters_only = re.sub("[^\w\s]", " ", Description_text)
Convert to lower case, split into individual words: I recommend using word_tokenize again, here.
from nltk.tokenize import word_tokenize
words = word_tokenize(letters_only.lower())
Remove stop words.
stops = set(stopwords.words("english"))
meaningful_words = [w for w in words if not w in stops]
Stem words. Here is another issue. Stem meaningful_words, not words.
return ' '.join(stemmer.stem(w) for w in meaningful_words])

How can I print the subject of a text

Im using this code to determine what the subject, and time/location of a sentence.
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
import nltk
text = input('Please enter a sentence: ')
words = text.split()
sentence = pos_tag(words)
grammar = '''
Action: {<NN.*>?<VB.*><RB.*>?}
Location: {<IN><NN.*>+}
Subject: {<DT>?<JJ>*<NN.*>}
'''
cp = nltk.RegexpParser(grammar, "Input")
result = cp.parse(sentence)
result.draw()
How can i print just the subject of the sentence?
You can use the "label" attribute the Tree objects. Here I set up a loop to check each element of the result to see if it is a nltk.tree.Tree instance. Then, if the label is "Subject", it gets appended t
subject = []
for word in result:
print word
if isinstance(word, nltk.tree.Tree):
if word.label() == 'Subject':
subject.append(word)
# Sentence returned multiple subjects, including direct object, so draw first one
subject[0].draw()
Of course, this assumes that whatever is labeled "Subject" is what you want to draw.

Categories