What does NN VBD IN DT NNS RB means in NLTK? - python

when I chunk text, I get lots of codes in the output like
NN, VBD, IN, DT, NNS, RB.
Is there a list documented somewhere which tells me the meaning of these?
I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens.
But I am not able to find any documentation which explains what these codes mean.

The tags that you see are not a result of the chunks but the POS tagging that happens before chunking. It's the Penn Treebank tagset, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> sent = "This is a Foo Bar sentence."
# POS tag.
>>> nltk.pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('Foo', 'NNP'), ('Bar', 'NNP'), ('sentence', 'NN'), ('.', '.')]
>>> tagged_sent = nltk.pos_tag(word_tokenize(sent))
# Chunk.
>>> ne_chunk(tagged_sent)
Tree('S', [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]), ('sentence', 'NN'), ('.', '.')])
To get the chunks look for subtrees within the chunked outputs. From the above output, the Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]) indicates the chunk.
This tutorial site is pretty helpful to explain the chunking process in NLTK: http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf.
For official documentation, see http://www.nltk.org/howto/chunk.html

Even though the above links have all kinds. But hope this is still helpful for someone, added a few that are missed on other links.
CC: Coordinating conjunction
CD: Cardinal number
DT: Determiner
EX: Existential there
FW: Foreign word
IN: Preposition or subordinating conjunction
JJ: Adjective
VP: Verb Phrase
JJR: Adjective, comparative
JJS: Adjective, superlative
LS: List item marker
MD: Modal
NN: Noun, singular or mass
NNS: Noun, plural
PP: Preposition Phrase
NNP: Proper noun, singular Phrase
NNPS: Proper noun, plural
PDT: Pre determiner
POS: Possessive ending
PRP: Personal pronoun Phrase
PRP: Possessive pronoun Phrase
RB: Adverb
RBR: Adverb, comparative
RBS: Adverb, superlative
RP: Particle
S: Simple declarative clause
SBAR: Clause introduced by a (possibly empty) subordinating conjunction
SBARQ: Direct question introduced by a wh-word or a wh-phrase.
SINV: Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.
SQ: Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.
SYM: Symbol
VBD: Verb, past tense
VBG: Verb, gerund or present participle
VBN: Verb, past participle
VBP: Verb, non-3rd person singular present
VBZ: Verb, 3rd person singular present
WDT: Wh-determiner
WP: Wh-pronoun
WP: Possessive wh-pronoun
WRB: Wh-adverb

As told by Alvas above, these tags are part-of-speech which tells whether a word/phrase is Noun phrase,Adverb,determiner,verb etc...
Here are the POS Tag details you can refer.
Chunking recovers the phrased from the Part of speech tags
You can refer this link for reading for about chunking.

Related

Chunking - regular expressions and trees

I'm a total noob so sorry if I'm asking something obvious. My question is twofold, or rather it's two questions in the same topic:
I'm studying nltk in Uni, and we're doing chunks. In the grammar I have on my notes the following code:
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
What is the "$" symbol for in this case? I know it's "end of the line" in regex, but what does it stand for here?
Also, in my text book there's a Tree that's been printed without using the .draw() function, to this result:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
How the heck one does that???
Thanks in advance to anybody who'll have the patience to school this noob :D
This is the code of your example:
import nltk
sentence = [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
#output
#(S(CLAUSE (NP the/DT book/NN) (VP has/VBZ (NP many/JJ chapters/NNS))))
result.draw()
The tree of:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
I found this link where you can learn a lot.
The $ symbol is a special character in regular expressions, and must
be backslash escaped in order to match the tag PP$.
$ Example:
Xyz$ -> Used to match the pattern xyz at the end of a string

Python - How to extract chunked elements?

I am new to python and I'm giving nltk a shot. I came across the following:
namedEnt = nltk.ne_chunk(tagged)
Where tagged is
tagged = nltk.pos_tag(words)
and words are token's of a sentence.
I would like to remove the stop words of namedEnt. I was able to first remove the stop words from the tokens and then chunk, but was not able to chunk and then remove the stop words.Is it possible ? If so, how could I do this ?
Eg: Sentence- "get me todays menu." is tagged into
('get', 'VB')
('me', 'PRP')
('todays', 'JJ')
('menu', 'NN')
('.', '.')
and I would like to get
('get','VB')('todays','JJ')('menu','NN')('.','.')
Not sure if I get what you want (if you feed filtered from the example below to nltk.ne_chunk, it probably results in incorrect output because ne_chunk expects a grammatical sentence I guess), but your input output can be achieved in the following way:
>>> import nltk
>>> words = ['Get', 'me', 'todays', 'menu', '.']
>>> tagged = nltk.pos_tag(words)
>>> tagged
[('Get', 'NN'), ('me', 'PRP'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
>>> filtered = [t for t in tagged if not t[0] in set(nltk.corpus.stopwords.words('english'))]
>>> filtered
[('Get', 'NN'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
Again, not sure how useful this actually is. NamedEnt in your example returns a tree structure, so you'd have to traverse this tree to extract whichever you are interested in, which probably means skipping over stopwords anyway...

Identify the word as a noun, verb or adjective

Given a single word such as "table", I want to identify what it is most commonly used as, whether its most common usage is noun, verb or adjective. I want to do this in python. Is there anything else besides wordnet too? I don't prefer wordnet. Or, if I use wordnet, how would I do it exactly with it?
import nltk
text = 'This is a table. We should table this offer. The table is in the center.'
text = nltk.word_tokenize(text)
result = nltk.pos_tag(text)
result = [i for i in result if i[0].lower() == 'table']
print(result) # [('table', 'JJ'), ('table', 'VB'), ('table', 'NN')]
If you have a word out of context and want to know its most common use, you could look at someone else's frequency table (e.g. WordNet), or you can do your own counts: Just find a tagged corpus that's large enough for your purposes, and count its instances. If you want to use a free corpus, the NLTK includes the Brown corpus (1 million words). The NLTK also provides methods for working with larger, non-free corpora (e.g, the British National Corpus).
import nltk
from nltk.corpus import brown
table = nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() == 'table')
print(table.most_common())
[('NN', 147), ('NN-TL', 50), ('VB', 1)]

NLTK convert tokenized sentence to synset format

I'm looking to get the similarity between a single word and each word in a sentence using NLTK.
NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print dog.path_similarity(cat)
>> 0.2
The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
pos_tag(text)
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01
You can use a simple conversion function:
from nltk.corpus import wordnet as wn
def penn_to_wn(tag):
if tag.startswith('J'):
return wn.ADJ
elif tag.startswith('N'):
return wn.NOUN
elif tag.startswith('R'):
return wn.ADV
elif tag.startswith('V'):
return wn.VERB
return None
After tagging a sentence you can tie a word inside the sentence with a SYNSET using this function. Here's an example:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
sentence = "I am going to buy some gifts"
tagged = pos_tag(word_tokenize(sentence))
synsets = []
lemmatzr = WordNetLemmatizer()
for token in tagged:
wn_tag = penn_to_wn(token[1])
if not wn_tag:
continue
lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)
synsets.append(wn.synsets(lemma, pos=wn_tag)[0])
print synsets
Result: [Synset('be.v.01'), Synset('travel.v.01'), Synset('buy.v.01'), Synset('gift.n.01')]
You can use the alternative form of wordnet.synset:
wordnet.synset('dog', pos=wordnet.NOUN)
You'll still need to translate the tags offered by pos_tag into those supported by wordnet.sysnset -- unfortunately, I don't know of a pre-built dictionary doing that, so (unless I'm missing the existence of such a correspondence table) you'll need to build your own (you can do that once and pickle it for subsequent reloading).
See http://www.nltk.org/book/ch05.html , subchapter 1, on how to get help about a specific tagset -- e.g nltk.help.upenn_tagset('N.*') will confirm that the UPenn tagset (which I believe is the default one used by pos_tag) uses 'N' followed by something to identify variants of what synset will see as a wordnet.NOUN.
I have not tried http://www.nltk.org/_modules/nltk/tag/mapping.html but it might be just what you require -- give it a try!

custom tagging with nltk

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:
>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.
So anyway, my question is one of:
Is there a better tagger for this type of grammar?
Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form?
Is there a way to train a tagger?
Is there a better way altogether?
One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:
>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
Then you get
>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]
This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.
Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.
For example, consider the three sentences:
select the files
use the select function on the sockets
the select was good
Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.
import nltk.tag, nltk.data
from nltk import word_tokenize
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
def evaluate(tagger, sentences):
good,total = 0,0.
for sentence,func in sentences:
tags = tagger.tag(nltk.word_tokenize(sentence))
print tags
good += func(tags)
total += 1
print 'Accuracy:',good/total
sentences = [
('select the files', lambda tags: ('select', 'VB') in tags),
('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),
('the select was good', lambda tags: ('select', 'NN') in tags),
]
train_sents = [
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],
[('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],
[('the', 'DT'), ('select', 'NN'), ('files', 'NNS')],
]
tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger)
evaluate(tagger, sentences)
#model = tagger._context_to_tag
Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.
See Jacob's answer.
In later versions (at least nltk 3.2) nltk.tag._POS_TAGGER does not exist. The default taggers are usually downloaded into the nltk_data/taggers/ directory, e.g.:
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
Usage is as follows.
>>> import nltk.tag, nltk.data
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
>>> default_tagger = nltk.data.load(tagger_path)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
See also: How to do POS tagging using the NLTK POS tagger in Python.
Bud's answer is correct. Also, according to this link,
if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path.
Meaning, you can just say
tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
default_tagger = nltk.data.load(tagger_path)

Categories