I'm trying to lemmatize all of the words in a sentence with NLTK's WordNetLemmatizer. I have a bunch of sentences but am just using the first sentence to ensure I'm doing this correctly. Here's what I have:
train_sentences[0]
"Explanation Why edits made username Hardcore Metallica Fan reverted? They vandalisms, closure GAs I voted New York Dolls FAC. And please remove template talk page since I'm retired now.89.205.38.27"
So now I try to lemmatize each word as follows:
lemmatizer = WordNetLemmatizer()
new_sent = [lemmatizer.lemmatize(word) for word in train_sentences[0].split()]
print(new_sent)
And I get back:
['Explanation', 'Why', 'edits', 'made', 'username', 'Hardcore', 'Metallica', 'Fan', 'reverted?', 'They', 'vandalisms,', 'closure', 'GAs', 'I', 'voted', 'New', 'York', 'Dolls', 'FAC.', 'And', 'please', 'remove', 'template', 'talk', 'page', 'since', "I'm", 'retired', 'now.89.205.38.27']
A couple questions:
1) Why does "edits" not get transformed into "edit"? Admittedly, if I do lemmatizer.lemmatize("edits") I get back edits but was surprised.
2) Why is "vandalisms" not transformed into "vandalism"? This one is very surprising, since if I do lemmatizer.lemmatize("vandalisms"), I get back vandalism...
Any clarification / guidance would be awesome!
TL;DR
First tag the sentence, then use the POS tag as the additional parameter input for the lemmatization.
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def penn2morphy(penntag):
""" Converts Penn Treebank tags to WordNet. """
morphy_tag = {'NN':'n', 'JJ':'a',
'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
return 'n'
def lemmatize_sent(text):
# Text input is string, returns lowercased strings.
return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag))
for word, tag in pos_tag(word_tokenize(text))]
lemmatize_sent('He is walking to school')
For a detailed walkthrough of how and why the POS tag is necessary see https://www.kaggle.com/alvations/basic-nlp-with-nltk
Alternatively, you can use pywsd tokenizer + lemmatizer, a wrapper of NLTK's WordNetLemmatizer:
Install:
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
Code:
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.307677984237671 secs.
>>> text = "Mary leaves the room"
>>> lemmatize_sentence(text)
['mary', 'leave', 'the', 'room']
>>> text = 'Dew drops fall from the leaves'
>>> lemmatize_sentence(text)
['dew', 'drop', 'fall', 'from', 'the', 'leaf']
(Note to moderators: I can't mark this question as duplicate of nltk: How to lemmatize taking surrounding words into context? because the answer wasn't accepted there but it is a duplicate).
This is really something that the nltk community would be able to answer.
This is happening because of the , at the end of vandalisms,.To remove this trailing ,, you could use .strip(',') or use mutliple delimiters as described here.
Related
While tokenizing multiple sentences from a large corpus, I need to preserve certain words as in its original form like .Net, C#, C++. I also want to remove the punctuation marks (.,!_-()=*&^%$#~ etc.) but need to preserve the words like .net, .htaccess, .htpassword, c++ etc.
I have tried both nltk.word_tokenize and nltk.regexp_tokenize, but I am not getting the expected output.
Please help me in fixing the aforementioned issue.
The code:
import nltk
from nltk import regexp_tokenize
from nltk.corpus import stopwords
def pre_data():
tokenized_sentences = nltk.sent_tokenize(tokenized_raw_data)
sw0 = (stopwords.words('english'))
sw1 = ["i.e", "dxint", "hrangle", "idoteq", "devs", "zero"]
sw = sw0 + sw1
tokens = [[word for word in regexp_tokenize(word, pattern=r"\s|\d|[^.+#\w a-z]", gaps=True)] for word in tokenized_sentences]
print(tokens)
pre_data()
The tokenized_raw_data is a normal text file. It contains multiple sentences with white spaces in between and consisting of words like .blog, .net, c++, c#, asp.net, .htaccess etc.
Example
['.blog is a generic top-level domain intended for use by blogs'.,
'C# is a general-purpose, multi-paradigm programming language'.,
'C++ is object-oriented programming language'.]
This solution covers the given examples and preserves words like C++, C# asp.net and so on while removing normal punctuation.
import nltk
paragraph = (
'.blog is a generic top-level domain intended for use by blogs. '
'C# is a general-purpose, multi-paradigm programming language. '
'C++ is object-oriented programming language. '
'asp.net is something very strange. '
'The most fascinating language is c#. '
'.htaccess makes my day!'
)
def pre_data(raw_data):
tokenized_sentences = nltk.sent_tokenize(raw_data)
tokens = [nltk.regexp_tokenize(sentence, pattern='\w*\.?\w+[#+]*') for sentence in tokenized_sentences]
return tokens
tokenized_data = pre_data(paragraph)
print(tokenized_data)
Out
[['.blog', 'is', 'a', 'generic', 'top', 'level', 'domain', 'intended', 'for', 'use', 'by', 'blogs'],
['C#', 'is', 'a', 'general', 'purpose', 'multi', 'paradigm', 'programming', 'language'],
['C++', 'is', 'object', 'oriented', 'programming', 'language'],
['asp.net', 'is', 'something', 'very', 'strange'],
['The', 'most', 'fascinating', 'language', 'is', 'c#'],
['.htaccess', 'makes', 'my', 'day']]
However, this simple regular expression will probably not work for all technical terms in your texts. Provide full examples for a more general solution.
I'm new in python . I have a big data set from twitter and i want to tokenize it .
but i don't know how can i token verbs like this : "look for , take off ,grow up and etc." and it's important to me .
my code is :
>>> from nltk.tokenize import word_tokenize
>>> s = "I'm looking for the answer"
>>> word_tokenize(s)
['I', "'m", 'looking', 'for', 'the', 'answer']
my data set is big and i can't use this page code :
Find multi-word terms in a tokenized text in Python
so , how can i solve my problem?
You need to use parts of speech tags for that, or actually dependency parsing would be more accurate. I haven't tried with nltk, but with spaCy you can do it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
def chunk_phrasal_verbs(lemmatized_sentence):
ph_verbs = []
for word in nlp(lemmatized_sentence):
if word.dep_ == 'prep' and word.head.pos_ == 'VERB':
ph_verb = word.head.text+ ' ' + word.text
ph_verbs.append(ph_verb)
return ph_verbs
I also suggest first lemmatizing the sentence to get rid of conjugations. Also if you need noun phrases, with the similar way you can use compound relationship.
from stemming.porter2 import stem
documents = ['got',"get"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
print(documents)
The result is :
[['got'], ['get']]
Can someone help to explain this ?
Thank you !
What you want is a lemmatizer instead of a stemmer. The difference is subtle.
Generally, a stemmer drops suffixes as much as possible and in some cases handles an exception list of words for words that cannot find a normalized form by simply dropping suffixes.
A lemmatizer tries to find the "basic"/root/infinitive form of a word and usually, it requires specialized rules for different languages.
See
what is the true difference between lemmatization vs stemming?
Stemmers vs Lemmatizers
Lemmatization using the NLTK implementation of the morphy lemmatizer requires the correct part-of-speech (POS) tag to be fairly accurate.
Avoid (or in fact never) try to lemmatize individual word in isolation. Try lemmatizing a fully POS tagged sentence, e.g.
from nltk import word_tokenize, pos_tag
from nltk import wordnet as wn
def penn2morphy(penntag, returnNone=False, default_to_noun=False):
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
With the penn2morphy helper function, you need to convert the POS tag from pos_tag() to the morphy tags and you can then:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = "He got up in bed at 8am."
>>> [(token, penn2morphy(tag)) for token, tag in pos_tag(word_tokenize(sent))]
[('He', ''), ('got', 'v'), ('up', ''), ('in', ''), ('bed', 'n'), ('at', ''), ('8am', ''), ('.', '')]
>>> [wnl.lemmatize(token, pos=penn2morphy(tag, default_to_noun=True)) for token, tag in pos_tag(word_tokenize(sent))]
['He', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
For convenience you can also try the pywsd lemmatizer.
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 7.196984529495239 secs.
>>> sent = "He got up in bed at 8am."
>>> lemmatize_sentence(sent)
['he', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
See also https://stackoverflow.com/a/22343640/610569
The following code prints out leaf:
from nltk.stem.wordnet import WordNetLemmatizer
lem = WordNetLemmatizer()
print(lem.lemmatize('leaves'))
This may or may not be accurate depending on the surrounding context, e.g. Mary leaves the room vs. Dew drops fall from the leaves. How can I tell NLTK to lemmatize words taking surrounding context into account?
TL;DR
First tag the sentence, then use the POS tag as the additional parameter input for the lemmatization.
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def penn2morphy(penntag):
""" Converts Penn Treebank tags to WordNet. """
morphy_tag = {'NN':'n', 'JJ':'a',
'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
return 'n'
def lemmatize_sent(text):
# Text input is string, returns lowercased strings.
return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag))
for word, tag in pos_tag(word_tokenize(text))]
lemmatize_sent('He is walking to school')
For a detailed walkthrough of how and why the POS tag is necessary see https://www.kaggle.com/alvations/basic-nlp-with-nltk
Alternatively, you can use pywsd tokenizer + lemmatizer, a wrapper of NLTK's WordNetLemmatizer:
Install:
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
Code:
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.307677984237671 secs.
>>> text = "Mary leaves the room"
>>> lemmatize_sentence(text)
['mary', 'leave', 'the', 'room']
>>> text = 'Dew drops fall from the leaves'
>>> lemmatize_sentence(text)
['dew', 'drop', 'fall', 'from', 'the', 'leaf']
I am trying to get the basic english word for an english word which is modified from its base form. This question had been asked here, but I didnt see a proper answer, so I am trying to put it this way. I tried 2 stemmers and one lemmatizer from NLTK package which are porter stemmer, snowball stemmer, and wordnet lemmatiser.
I tried this code:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
words = ['arrival','conclusion','ate']
for word in words:
print "\n\nOriginal Word =>", word
print "porter stemmer=>", PorterStemmer().stem(word)
snowball_stemmer = SnowballStemmer("english")
print "snowball stemmer=>", snowball_stemmer.stem(word)
print "WordNet Lemmatizer=>", WordNetLemmatizer().lemmatize(word)
This is the output I get:
Original Word => arrival
porter stemmer=> arriv
snowball stemmer=> arriv
WordNet Lemmatizer=> arrival
Original Word => conclusion
porter stemmer=> conclus
snowball stemmer=> conclus
WordNet Lemmatizer=> conclusion
Original Word => ate
porter stemmer=> ate
snowball stemmer=> ate
WordNet Lemmatizer=> ate
but I want this output
Input : arrival
Output: arrive
Input : conclusion
Output: conclude
Input : ate
Output: eat
How can I achieve this? Are there any tools already available for this? This is called as morphological analysis. I am aware of that, but there must be some tools which are already achieving this. Help is appreciated :)
First Edit
I tried this code
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
query = "The Indian economy is the worlds tenth largest by nominal GDP and third largest by purchasing power parity"
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return wn.NOUN
tags = nltk.pos_tag(word_tokenize(query))
for tag in tags:
wn_tag = penn_to_wn(tag[1])
print tag[0]+"---> "+WordNetLemmatizer().lemmatize(tag[0],wn_tag)
Here, I tried to use wordnet lemmatizer by providing proper tags. Here is the output:
The---> The
Indian---> Indian
economy---> economy
is---> be
the---> the
worlds---> world
tenth---> tenth
largest---> large
by---> by
nominal---> nominal
GDP---> GDP
and---> and
third---> third
largest---> large
by---> by
purchasing---> purchase
power---> power
parity---> parity
Still, words like "arrival" and "conclusion" wont get processed with this approach. Is there any solution for this?
Try word_stemmer package, clone it from here and do pip install -e word_forms.
from word_forms.word_forms import get_word_forms
get_word_forms('conclusion')
# gives:
{'a': {'conclusive'},
'n': {'conclusion', 'conclusions', 'conclusivenesses', 'conclusiveness'},
'r': {'conclusively'},
'v': {'concludes', 'concluded', 'concluding', 'conclude'}}
In your case, you'd like to get a verb form from a noun word form.
Ok, so... for the word "ate" I think you're looking for NodeBox::Linguistics.
print en.verb.present("gave")
>>> give
And I did not completely understand why do you want the verb or "arrival" but not the one of "conclusion".