NLTK - Lemmatizing the tokens before being chunked - python

I am currently stuck in this problem.
NLTK's Chunking function is like this:
tokens = nltk.word_tokenize(word)
tagged = nltk.pos_tag(tokens)
chunking = nltk.chunk.ne_chunk(tagged)
is there any way to lemmatize the tokens with its tag before being chunked? Like
lmtzr.lemmatize('tokens, pos=tagged)
I have tried to lemmatize the chunk, but it is not working (error says something about chunking being a list). I am new to python, so my knowledge about it isn't that great. Any help would be great!

You can lemmatize directly without pos_tag -
import nltk
from nltk.corpus import wordnet
lmtzr = nltk.WordNetLemmatizer()
word = "Here are words and cars"
tokens = nltk.word_tokenize(word)
token_lemma = [ lmtzr.lemmatize(token) for token in tokens ]
tagged = nltk.pos_tag(token_lemma)
chunking = nltk.chunk.ne_chunk(tagged)
Output
['Here', 'are', 'word', 'and', 'car'] # lemmatize output
[('Here', 'RB'), ('are', 'VBP'), ('word', 'NN'), ('and', 'CC'), ('car', 'NN')]
(S Here/RB are/VBP word/NN and/CC car/NN)

Related

Why the stem of word "got " is still "got" instead of "get"?

from stemming.porter2 import stem
documents = ['got',"get"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
print(documents)
The result is :
[['got'], ['get']]
Can someone help to explain this ?
Thank you !
What you want is a lemmatizer instead of a stemmer. The difference is subtle.
Generally, a stemmer drops suffixes as much as possible and in some cases handles an exception list of words for words that cannot find a normalized form by simply dropping suffixes.
A lemmatizer tries to find the "basic"/root/infinitive form of a word and usually, it requires specialized rules for different languages.
See
what is the true difference between lemmatization vs stemming?
Stemmers vs Lemmatizers
Lemmatization using the NLTK implementation of the morphy lemmatizer requires the correct part-of-speech (POS) tag to be fairly accurate.
Avoid (or in fact never) try to lemmatize individual word in isolation. Try lemmatizing a fully POS tagged sentence, e.g.
from nltk import word_tokenize, pos_tag
from nltk import wordnet as wn
def penn2morphy(penntag, returnNone=False, default_to_noun=False):
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
With the penn2morphy helper function, you need to convert the POS tag from pos_tag() to the morphy tags and you can then:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = "He got up in bed at 8am."
>>> [(token, penn2morphy(tag)) for token, tag in pos_tag(word_tokenize(sent))]
[('He', ''), ('got', 'v'), ('up', ''), ('in', ''), ('bed', 'n'), ('at', ''), ('8am', ''), ('.', '')]
>>> [wnl.lemmatize(token, pos=penn2morphy(tag, default_to_noun=True)) for token, tag in pos_tag(word_tokenize(sent))]
['He', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
For convenience you can also try the pywsd lemmatizer.
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 7.196984529495239 secs.
>>> sent = "He got up in bed at 8am."
>>> lemmatize_sentence(sent)
['he', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
See also https://stackoverflow.com/a/22343640/610569

Python - How to extract chunked elements?

I am new to python and I'm giving nltk a shot. I came across the following:
namedEnt = nltk.ne_chunk(tagged)
Where tagged is
tagged = nltk.pos_tag(words)
and words are token's of a sentence.
I would like to remove the stop words of namedEnt. I was able to first remove the stop words from the tokens and then chunk, but was not able to chunk and then remove the stop words.Is it possible ? If so, how could I do this ?
Eg: Sentence- "get me todays menu." is tagged into
('get', 'VB')
('me', 'PRP')
('todays', 'JJ')
('menu', 'NN')
('.', '.')
and I would like to get
('get','VB')('todays','JJ')('menu','NN')('.','.')
Not sure if I get what you want (if you feed filtered from the example below to nltk.ne_chunk, it probably results in incorrect output because ne_chunk expects a grammatical sentence I guess), but your input output can be achieved in the following way:
>>> import nltk
>>> words = ['Get', 'me', 'todays', 'menu', '.']
>>> tagged = nltk.pos_tag(words)
>>> tagged
[('Get', 'NN'), ('me', 'PRP'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
>>> filtered = [t for t in tagged if not t[0] in set(nltk.corpus.stopwords.words('english'))]
>>> filtered
[('Get', 'NN'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
Again, not sure how useful this actually is. NamedEnt in your example returns a tree structure, so you'd have to traverse this tree to extract whichever you are interested in, which probably means skipping over stopwords anyway...

How to do POS tagging on Telugu text?

I am doing English POS tagging for so long. It's staright forward like
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
But I want to do it in Telugu.
I followed this article - http://jaganadhg.freeflux.net/blog/archive/2009/10/12/nltk-and-indian-language-corpus-processing-part-ii.html
And could test few inbuilt sentences.
But I could not figure out the way to test any random Telugu text. Could some one please guide if he has experience in using NLTK for non english text.
I have total number words
telugu.pos
9999
sentences
1197
telugu.pos

NLTK convert tokenized sentence to synset format

I'm looking to get the similarity between a single word and each word in a sentence using NLTK.
NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print dog.path_similarity(cat)
>> 0.2
The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
pos_tag(text)
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01
You can use a simple conversion function:
from nltk.corpus import wordnet as wn
def penn_to_wn(tag):
if tag.startswith('J'):
return wn.ADJ
elif tag.startswith('N'):
return wn.NOUN
elif tag.startswith('R'):
return wn.ADV
elif tag.startswith('V'):
return wn.VERB
return None
After tagging a sentence you can tie a word inside the sentence with a SYNSET using this function. Here's an example:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
sentence = "I am going to buy some gifts"
tagged = pos_tag(word_tokenize(sentence))
synsets = []
lemmatzr = WordNetLemmatizer()
for token in tagged:
wn_tag = penn_to_wn(token[1])
if not wn_tag:
continue
lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)
synsets.append(wn.synsets(lemma, pos=wn_tag)[0])
print synsets
Result: [Synset('be.v.01'), Synset('travel.v.01'), Synset('buy.v.01'), Synset('gift.n.01')]
You can use the alternative form of wordnet.synset:
wordnet.synset('dog', pos=wordnet.NOUN)
You'll still need to translate the tags offered by pos_tag into those supported by wordnet.sysnset -- unfortunately, I don't know of a pre-built dictionary doing that, so (unless I'm missing the existence of such a correspondence table) you'll need to build your own (you can do that once and pickle it for subsequent reloading).
See http://www.nltk.org/book/ch05.html , subchapter 1, on how to get help about a specific tagset -- e.g nltk.help.upenn_tagset('N.*') will confirm that the UPenn tagset (which I believe is the default one used by pos_tag) uses 'N' followed by something to identify variants of what synset will see as a wordnet.NOUN.
I have not tried http://www.nltk.org/_modules/nltk/tag/mapping.html but it might be just what you require -- give it a try!

custom tagging with nltk

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:
>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.
So anyway, my question is one of:
Is there a better tagger for this type of grammar?
Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form?
Is there a way to train a tagger?
Is there a better way altogether?
One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:
>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
Then you get
>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]
This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.
Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.
For example, consider the three sentences:
select the files
use the select function on the sockets
the select was good
Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.
import nltk.tag, nltk.data
from nltk import word_tokenize
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
def evaluate(tagger, sentences):
good,total = 0,0.
for sentence,func in sentences:
tags = tagger.tag(nltk.word_tokenize(sentence))
print tags
good += func(tags)
total += 1
print 'Accuracy:',good/total
sentences = [
('select the files', lambda tags: ('select', 'VB') in tags),
('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),
('the select was good', lambda tags: ('select', 'NN') in tags),
]
train_sents = [
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],
[('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],
[('the', 'DT'), ('select', 'NN'), ('files', 'NNS')],
]
tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger)
evaluate(tagger, sentences)
#model = tagger._context_to_tag
Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.
See Jacob's answer.
In later versions (at least nltk 3.2) nltk.tag._POS_TAGGER does not exist. The default taggers are usually downloaded into the nltk_data/taggers/ directory, e.g.:
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
Usage is as follows.
>>> import nltk.tag, nltk.data
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
>>> default_tagger = nltk.data.load(tagger_path)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
See also: How to do POS tagging using the NLTK POS tagger in Python.
Bud's answer is correct. Also, according to this link,
if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path.
Meaning, you can just say
tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
default_tagger = nltk.data.load(tagger_path)

Categories