I'm using the NLTK WordNet Lemmatizer for a Part-of-Speech tagging project by first modifying each word in the training corpus to its stem (in place modification), and then training only on the new corpus. However, I found that the lemmatizer is not functioning as I expected it to.
For example, the word loves is lemmatized to love which is correct, but the word loving remains loving even after lemmatization. Here loving is as in the sentence "I'm loving it".
Isn't love the stem of the inflected word loving? Similarly, many other 'ing' forms remain as they are after lemmatization. Is this the correct behavior?
What are some other lemmatizers that are accurate? (need not be in NLTK) Are there morphology analyzers or lemmatizers that also take into account a word's Part Of Speech tag, in deciding the word stem? For example, the word killing should have kill as the stem if killing is used as a verb, but it should have killing as the stem if it is used as a noun (as in the killing was done by xyz).
The WordNet lemmatizer does take the POS tag into account, but it doesn't magically determine it:
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving')
'loving'
>>> nltk.stem.WordNetLemmatizer().lemmatize('loving', 'v')
u'love'
Without a POS tag, it assumes everything you feed it is a noun. So here it thinks you're passing it the noun "loving" (as in "sweet loving").
The best way to troubleshoot this is to actually look in Wordnet. Take a look here: Loving in wordnet. As you can see, there is actually an adjective "loving" present in Wordnet. As a matter of fact, there is even the adverb "lovingly": lovingly in Wordnet. Because wordnet doesn't actually know what part of speech you actually want, it defaults to noun ('n' in Wordnet). If you are using Penn Treebank tag set, here's some handy function for transforming Penn to WN tags:
from nltk.corpus import wordnet as wn
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return None
Hope this helps.
it's clearer and more effective than enumeration:
from nltk.corpus import wordnet
def get_wordnet_pos(self, treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
def penn_to_wn(tag):
return get_wordnet_pos(tag)
As an extension to the accepted answer from #Fred Foo above;
from nltk import WordNetLemmatizer, pos_tag, word_tokenize
lem = WordNetLemmatizer()
word = input("Enter word:\t")
# Get the single character pos constant from pos_tag like this:
pos_label = (pos_tag(word_tokenize(word))[0][1][0]).lower()
# pos_refs = {'n': ['NN', 'NNS', 'NNP', 'NNPS'],
# 'v': ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'],
# 'r': ['RB', 'RBR', 'RBS'],
# 'a': ['JJ', 'JJR', 'JJS']}
if pos_label == 'j': pos_label = 'a' # 'j' <--> 'a' reassignment
if pos_label in ['r']: # For adverbs it's a bit different
print(wordnet.synset(word+'.r.1').lemmas()[0].pertainyms()[0].name())
elif pos_label in ['a', 's', 'v']: # For adjectives and verbs
print(lem.lemmatize(word, pos=pos_label))
else: # For nouns and everything else as it is the default kwarg
print(lem.lemmatize(word))
Related
My goal is to label a word as either a noun, verb, adjective, or adverb (part-of-speech-tagging) using nltk and its wordnet function. I followed the code example of this stackoverflow thread to calculate the tag:
#import
from nltk.corpus import wordnet
#get part of speech tagging
def get_wordnet_pos(word):
tag = nltk.pos_tag([word])[0][1][0].lower()
tag_dict = {"a": wordnet.ADJ,
"n": wordnet.NOUN,
"v": wordnet.VERB,
"r": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
However, the resulting function only works for nouns, verbs, and adverbs, but it does not work for adjectives, as it always shows 'noun' ('n') instead of 'adjective' ('a') when I enter a word that is clearly an adjective. See the following output as an example:
get_wordnet_pos("biggest")
Out[86]: 'n'
get_wordnet_pos("small")
Out[87]: 'n'
get_wordnet_pos("tiny")
Out[88]: 'n'
get_wordnet_pos("great")
Out[89]: 'n'
get_wordnet_pos("house")
Out[90]: 'n'
get_wordnet_pos("cutting")
Out[92]: 'v'
get_wordnet_pos("largly")
Out[93]: 'r'
'Biggest', 'small', 'tiny', and 'great' are all labeled as a noun ('n'), even though they are clearly not. For all the other examples, the function works as intended (i.e., it labels nouns as 'noun', verbs as 'verb', and adverbs as 'adverb'). Why does tagging adjectives not work, while for all the other word categories the code does perform correctly? Could someone please help me, so that adjectives are labeled correctly instead of as nouns?
I tried looking up whether "wordnet.ADJ" is wrong (here), but it seems to be correct.
I'm currently working on lemmantizing a sentence while also applying pos_tags. This is what I have so far
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
lem = WordNetLemmatizer()
def findTag(sentence):
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
pos_label = nltk.pos_tag(sentence)[0][1][0].lower()
if pos_label == "j":
pos_label == "a"
if pos_label in ["a", "n", "v"]:
print(lem.lemmatize(word, pos = pos_label))
elif pos_label in ['r']:
print(wordnet.synset(sentence+".r.1").lemmas()[0].pertainyms()[0].name())
else:
print(lem.lemmatize(sentence))
findTag("I love running angrily")
However, when I input a sentence with this I get the error
Traceback (most recent call last):
File "spoilerDetect.py", line 25, in <module>
findTag("I love running angrily")
File "spoilerDetect.py", line 22, in findTag
print(lem.lemmatize(sentence))
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/nltk/stem/wordnet.py", line 41, in lemmatize
lemmas = wordnet._morphy(word, pos)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/nltk/corpus/reader/wordnet.py", line 1905, in _morphy
if form in exceptions:
TypeError: unhashable type: 'list'
I understand that lists are unhashable but am unsure of how to fix this. Do I change lists to a tuple or is there something I'm not understanding?
Lets walk through the code and see how to get your desired output.
First the imports, you have
import nltk
from nltk import pos_tag
and then you were using
pos_label = nltk.pos_tag(...)
Since you're already using from nltk import pos_tag, the pos_tag is already in the global namespace, just do:
pos_label = pos_tag(...)
Idiomatically, the imports should be cleaned up a little to look like this:
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
Next actually keeping the list of tokenized words and then the list of pos tags and then the list of lemmas separately sounds logical but since the function finally only returns the function, you should be able to chain up the pos_tag(word_tokenize(...)) function and iterate through it so that you can retrieve the POS tag and tokens, i.e.
sentence = "I love running angrily"
for word, pos in pos_tag(word_tokenize(sentence)):
print(word, '|', pos)
[out]:
I | PRP
love | VBP
running | VBG
angrily | RB
Now, we know that there's a mismatch between the outputs of pos_tag and the POS that the WordNetLemmatizer is expecting. From https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L124, there is a function call penn2morphy that looks like this:
def penn2morphy(penntag, returnNone=False, default_to_noun=False) -> str:
"""
Converts tags from Penn format (input: single string) to Morphy.
"""
morphy_tag = {'NN':'n', 'JJ':'a', 'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
An example:
>>> penn2morphy('JJ')
'a'
>>> penn2morphy('PRP')
''
And if we use these converted tags as inputs to the WordNetLemmatizer and reusing your if-else conditions:
sentence = "I love running angrily"
for token, pos in pos_tag(word_tokenize(sentence)):
morphy_pos = penn2morphy(pos)
if morphy_pos in ["a", "n", "v"]:
print(wnl.lemmatize(token, pos=morphy_pos))
elif morphy_pos in ['r']:
print(wn.synset(token+".r.1").lemmas()[0].pertainyms()[0].name())
else:
print(wnl.lemmatize(token))
[out]:
I
love
run
angry
Hey, what did you do there? Your code works but mine doesn't!
Okay, now that we know how to get the desired output. Lets recap.
First, we clean up imports
Then, we clean up the preprocessing (without keeping intermediate variables)
Then, we "functionalized" the conversion of POS tags from Penn -> Morphy
Lastly, we applied the same if/else conditions and run the lemmatizer.
But how is it that my code doesn't work?!
Okay, lets work through your code to see why you're getting the error.
First lets check every output that you get within the findTag function, printing the type of the output and the output
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
print(type(sentence))
print(sentence)
[out]:
<class 'list'>
['I', 'love', 'running', 'angrily']
At sentence = word_tokenize(sentence), you have already overwritten your original variable to the function, usually that's a sign of error later on =)
Now lets look at the next line:
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
print(type(sentence))
print(sentence)
[out]:
<class 'list'>
['I', 'love', 'running', 'angrily']
Now we see that the sentence = [i.strip(" ") for i in sentence] is actually meaningless given the example sentence.
Q: But is it true that all tokens output by word_tokenize would have no trailing/heading spaces which i.strip(' ') is trying to do?
A: Yes, it seems like so. Then NLTK first performs the regex operations on strings, then call the str.split() function which would have removed heading/trailing spaces between tokens, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/destructive.py#L141
Lets continue:
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
pos_label = nltk.pos_tag(sentence)[0][1][0].lower()
print(type(pos_label))
print(pos_label)
[out]:
<class 'str'>
p
Q: Wait a minute, where is the pos_label only a single string?
Q: And what is POS tag p?
A: Lets look closer what's happening in nltk.pos_tag(sentence)[0][1][0].lower()
Usually, when you have to do such [0][1][0] nested index retrieval, its error prone. We need to ask what's [0][1][0]?
We know that sentence now after sentence = word_tokenize(sentence) has became a list of strings. And pos_tag(sentence) would return a list of tuples of strings where the first item in the tuple is the token and the second the POS tag, i.e.
sentence = "I love running angrily"
sentence = word_tokenize(sentence)
sentence = [i.strip(" ") for i in sentence]
thing = pos_tag(sentence)
print(type(thing))
print(thing)
[out]:
<class 'list'>
[('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
Now if we know thing = pos_tag(word_tokenize("I love running angrily")), outputs the above, lets work with that to see what [0][1][0] is accessing.
>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1]
('I', 'PRP')
So thing[0] outputs the tuple of (token, pos) for the first token.
>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1]
'PRP'
And thing[0][1] outputs the POS for the first token.
>>> thing = [('I', 'PRP'), ('love', 'VBP'), ('running', 'VBG'), ('angrily', 'RB')]
>>> thing[0][1][0]
'P'
Achso, the [0][1][0] looks for the first character of the POS of the first token.
So the question is that the desired behavior? If so, why are you only looking at the POS of the first word?
Regardless of what I'm looking at. Your explanation still that doesn't tell me why the TypeError: unhashable type: 'list' occurs. Stop distracting me and tell me how to resolve the TypeError!!
Okay, we move on, now that we know thing = pos_tag(word_tokenize("I love running angrily")) and thing[0][1][0].lower() = 'p'
Given your if-else conditions,
if pos_label in ["a", "n", "v"]:
print(lem.lemmatize(word, pos = pos_label))
elif pos_label in ['r']:
print(wordnet.synset(sentence+".r.1").lemmas()[0].pertainyms()[0].name())
else:
print(lem.lemmatize(sentence))
we find that 'p' value would have gone to the else, i.e. print(lem.lemmatize(sentence)) but wait a minute remember what has sentence became after you've modified it with:
>>> sentence = word_tokenize("I love running angrily")
>>> sentence = [i.strip(" ") for i in sentence]
>>> sentence
['I', 'love', 'running', 'angrily']
So what happens if we just ignore all the rest of the code and focus on this:
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
sentence = ['I', 'love', 'running', 'angrily']
lem.lemmatize(sentence)
[out]:
-------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-497ae98ecaa3> in <module>
4 sentence = ['I', 'love', 'running', 'angrily']
5
----> 6 lem.lemmatize(sentence)
~/Library/Python/3.6/lib/python/site-packages/nltk/stem/wordnet.py in lemmatize(self, word, pos)
39
40 def lemmatize(self, word, pos=NOUN):
---> 41 lemmas = wordnet._morphy(word, pos)
42 return min(lemmas, key=len) if lemmas else word
43
~/Library/Python/3.6/lib/python/site-packages/nltk/corpus/reader/wordnet.py in _morphy(self, form, pos, check_exceptions)
1903 # 0. Check the exception lists
1904 if check_exceptions:
-> 1905 if form in exceptions:
1906 return filter_forms([form] + exceptions[form])
1907
TypeError: unhashable type: 'list'
Ah ha!! That's where the error occurs!!!
It's because WordNetLemmatizer is expecting a single string input and you're putting a list of strings. Example usage:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
token = 'words'
wnl.lemmatize(token, pos='n')
Q: Why didn't you just get to the point?!
A: Then you would miss out on how to debug your code and make it better =)
from stemming.porter2 import stem
documents = ['got',"get"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
print(documents)
The result is :
[['got'], ['get']]
Can someone help to explain this ?
Thank you !
What you want is a lemmatizer instead of a stemmer. The difference is subtle.
Generally, a stemmer drops suffixes as much as possible and in some cases handles an exception list of words for words that cannot find a normalized form by simply dropping suffixes.
A lemmatizer tries to find the "basic"/root/infinitive form of a word and usually, it requires specialized rules for different languages.
See
what is the true difference between lemmatization vs stemming?
Stemmers vs Lemmatizers
Lemmatization using the NLTK implementation of the morphy lemmatizer requires the correct part-of-speech (POS) tag to be fairly accurate.
Avoid (or in fact never) try to lemmatize individual word in isolation. Try lemmatizing a fully POS tagged sentence, e.g.
from nltk import word_tokenize, pos_tag
from nltk import wordnet as wn
def penn2morphy(penntag, returnNone=False, default_to_noun=False):
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
With the penn2morphy helper function, you need to convert the POS tag from pos_tag() to the morphy tags and you can then:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = "He got up in bed at 8am."
>>> [(token, penn2morphy(tag)) for token, tag in pos_tag(word_tokenize(sent))]
[('He', ''), ('got', 'v'), ('up', ''), ('in', ''), ('bed', 'n'), ('at', ''), ('8am', ''), ('.', '')]
>>> [wnl.lemmatize(token, pos=penn2morphy(tag, default_to_noun=True)) for token, tag in pos_tag(word_tokenize(sent))]
['He', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
For convenience you can also try the pywsd lemmatizer.
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 7.196984529495239 secs.
>>> sent = "He got up in bed at 8am."
>>> lemmatize_sentence(sent)
['he', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
See also https://stackoverflow.com/a/22343640/610569
I'm trying to lemmatize all of the words in a sentence with NLTK's WordNetLemmatizer. I have a bunch of sentences but am just using the first sentence to ensure I'm doing this correctly. Here's what I have:
train_sentences[0]
"Explanation Why edits made username Hardcore Metallica Fan reverted? They vandalisms, closure GAs I voted New York Dolls FAC. And please remove template talk page since I'm retired now.89.205.38.27"
So now I try to lemmatize each word as follows:
lemmatizer = WordNetLemmatizer()
new_sent = [lemmatizer.lemmatize(word) for word in train_sentences[0].split()]
print(new_sent)
And I get back:
['Explanation', 'Why', 'edits', 'made', 'username', 'Hardcore', 'Metallica', 'Fan', 'reverted?', 'They', 'vandalisms,', 'closure', 'GAs', 'I', 'voted', 'New', 'York', 'Dolls', 'FAC.', 'And', 'please', 'remove', 'template', 'talk', 'page', 'since', "I'm", 'retired', 'now.89.205.38.27']
A couple questions:
1) Why does "edits" not get transformed into "edit"? Admittedly, if I do lemmatizer.lemmatize("edits") I get back edits but was surprised.
2) Why is "vandalisms" not transformed into "vandalism"? This one is very surprising, since if I do lemmatizer.lemmatize("vandalisms"), I get back vandalism...
Any clarification / guidance would be awesome!
TL;DR
First tag the sentence, then use the POS tag as the additional parameter input for the lemmatization.
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
def penn2morphy(penntag):
""" Converts Penn Treebank tags to WordNet. """
morphy_tag = {'NN':'n', 'JJ':'a',
'VB':'v', 'RB':'r'}
try:
return morphy_tag[penntag[:2]]
except:
return 'n'
def lemmatize_sent(text):
# Text input is string, returns lowercased strings.
return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag))
for word, tag in pos_tag(word_tokenize(text))]
lemmatize_sent('He is walking to school')
For a detailed walkthrough of how and why the POS tag is necessary see https://www.kaggle.com/alvations/basic-nlp-with-nltk
Alternatively, you can use pywsd tokenizer + lemmatizer, a wrapper of NLTK's WordNetLemmatizer:
Install:
pip install -U nltk
python -m nltk.downloader popular
pip install -U pywsd
Code:
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 9.307677984237671 secs.
>>> text = "Mary leaves the room"
>>> lemmatize_sentence(text)
['mary', 'leave', 'the', 'room']
>>> text = 'Dew drops fall from the leaves'
>>> lemmatize_sentence(text)
['dew', 'drop', 'fall', 'from', 'the', 'leaf']
(Note to moderators: I can't mark this question as duplicate of nltk: How to lemmatize taking surrounding words into context? because the answer wasn't accepted there but it is a duplicate).
This is really something that the nltk community would be able to answer.
This is happening because of the , at the end of vandalisms,.To remove this trailing ,, you could use .strip(',') or use mutliple delimiters as described here.
I am trying to get the basic english word for an english word which is modified from its base form. This question had been asked here, but I didnt see a proper answer, so I am trying to put it this way. I tried 2 stemmers and one lemmatizer from NLTK package which are porter stemmer, snowball stemmer, and wordnet lemmatiser.
I tried this code:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
words = ['arrival','conclusion','ate']
for word in words:
print "\n\nOriginal Word =>", word
print "porter stemmer=>", PorterStemmer().stem(word)
snowball_stemmer = SnowballStemmer("english")
print "snowball stemmer=>", snowball_stemmer.stem(word)
print "WordNet Lemmatizer=>", WordNetLemmatizer().lemmatize(word)
This is the output I get:
Original Word => arrival
porter stemmer=> arriv
snowball stemmer=> arriv
WordNet Lemmatizer=> arrival
Original Word => conclusion
porter stemmer=> conclus
snowball stemmer=> conclus
WordNet Lemmatizer=> conclusion
Original Word => ate
porter stemmer=> ate
snowball stemmer=> ate
WordNet Lemmatizer=> ate
but I want this output
Input : arrival
Output: arrive
Input : conclusion
Output: conclude
Input : ate
Output: eat
How can I achieve this? Are there any tools already available for this? This is called as morphological analysis. I am aware of that, but there must be some tools which are already achieving this. Help is appreciated :)
First Edit
I tried this code
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
query = "The Indian economy is the worlds tenth largest by nominal GDP and third largest by purchasing power parity"
def is_noun(tag):
return tag in ['NN', 'NNS', 'NNP', 'NNPS']
def is_verb(tag):
return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
def is_adverb(tag):
return tag in ['RB', 'RBR', 'RBS']
def is_adjective(tag):
return tag in ['JJ', 'JJR', 'JJS']
def penn_to_wn(tag):
if is_adjective(tag):
return wn.ADJ
elif is_noun(tag):
return wn.NOUN
elif is_adverb(tag):
return wn.ADV
elif is_verb(tag):
return wn.VERB
return wn.NOUN
tags = nltk.pos_tag(word_tokenize(query))
for tag in tags:
wn_tag = penn_to_wn(tag[1])
print tag[0]+"---> "+WordNetLemmatizer().lemmatize(tag[0],wn_tag)
Here, I tried to use wordnet lemmatizer by providing proper tags. Here is the output:
The---> The
Indian---> Indian
economy---> economy
is---> be
the---> the
worlds---> world
tenth---> tenth
largest---> large
by---> by
nominal---> nominal
GDP---> GDP
and---> and
third---> third
largest---> large
by---> by
purchasing---> purchase
power---> power
parity---> parity
Still, words like "arrival" and "conclusion" wont get processed with this approach. Is there any solution for this?
Try word_stemmer package, clone it from here and do pip install -e word_forms.
from word_forms.word_forms import get_word_forms
get_word_forms('conclusion')
# gives:
{'a': {'conclusive'},
'n': {'conclusion', 'conclusions', 'conclusivenesses', 'conclusiveness'},
'r': {'conclusively'},
'v': {'concludes', 'concluded', 'concluding', 'conclude'}}
In your case, you'd like to get a verb form from a noun word form.
Ok, so... for the word "ate" I think you're looking for NodeBox::Linguistics.
print en.verb.present("gave")
>>> give
And I did not completely understand why do you want the verb or "arrival" but not the one of "conclusion".