Chunking - regular expressions and trees - python

I'm a total noob so sorry if I'm asking something obvious. My question is twofold, or rather it's two questions in the same topic:
I'm studying nltk in Uni, and we're doing chunks. In the grammar I have on my notes the following code:
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
What is the "$" symbol for in this case? I know it's "end of the line" in regex, but what does it stand for here?
Also, in my text book there's a Tree that's been printed without using the .draw() function, to this result:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
How the heck one does that???
Thanks in advance to anybody who'll have the patience to school this noob :D

This is the code of your example:
import nltk
sentence = [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
#output
#(S(CLAUSE (NP the/DT book/NN) (VP has/VBZ (NP many/JJ chapters/NNS))))
result.draw()
The tree of:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
I found this link where you can learn a lot.
The $ symbol is a special character in regular expressions, and must
be backslash escaped in order to match the tag PP$.
$ Example:
Xyz$ -> Used to match the pattern xyz at the end of a string

Related

Python - How to extract chunked elements?

I am new to python and I'm giving nltk a shot. I came across the following:
namedEnt = nltk.ne_chunk(tagged)
Where tagged is
tagged = nltk.pos_tag(words)
and words are token's of a sentence.
I would like to remove the stop words of namedEnt. I was able to first remove the stop words from the tokens and then chunk, but was not able to chunk and then remove the stop words.Is it possible ? If so, how could I do this ?
Eg: Sentence- "get me todays menu." is tagged into
('get', 'VB')
('me', 'PRP')
('todays', 'JJ')
('menu', 'NN')
('.', '.')
and I would like to get
('get','VB')('todays','JJ')('menu','NN')('.','.')
Not sure if I get what you want (if you feed filtered from the example below to nltk.ne_chunk, it probably results in incorrect output because ne_chunk expects a grammatical sentence I guess), but your input output can be achieved in the following way:
>>> import nltk
>>> words = ['Get', 'me', 'todays', 'menu', '.']
>>> tagged = nltk.pos_tag(words)
>>> tagged
[('Get', 'NN'), ('me', 'PRP'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
>>> filtered = [t for t in tagged if not t[0] in set(nltk.corpus.stopwords.words('english'))]
>>> filtered
[('Get', 'NN'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
Again, not sure how useful this actually is. NamedEnt in your example returns a tree structure, so you'd have to traverse this tree to extract whichever you are interested in, which probably means skipping over stopwords anyway...

How to do POS tagging on Telugu text?

I am doing English POS tagging for so long. It's staright forward like
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
But I want to do it in Telugu.
I followed this article - http://jaganadhg.freeflux.net/blog/archive/2009/10/12/nltk-and-indian-language-corpus-processing-part-ii.html
And could test few inbuilt sentences.
But I could not figure out the way to test any random Telugu text. Could some one please guide if he has experience in using NLTK for non english text.
I have total number words
telugu.pos
9999
sentences
1197
telugu.pos

What does NN VBD IN DT NNS RB means in NLTK?

when I chunk text, I get lots of codes in the output like
NN, VBD, IN, DT, NNS, RB.
Is there a list documented somewhere which tells me the meaning of these?
I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens.
But I am not able to find any documentation which explains what these codes mean.
The tags that you see are not a result of the chunks but the POS tagging that happens before chunking. It's the Penn Treebank tagset, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> sent = "This is a Foo Bar sentence."
# POS tag.
>>> nltk.pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('Foo', 'NNP'), ('Bar', 'NNP'), ('sentence', 'NN'), ('.', '.')]
>>> tagged_sent = nltk.pos_tag(word_tokenize(sent))
# Chunk.
>>> ne_chunk(tagged_sent)
Tree('S', [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]), ('sentence', 'NN'), ('.', '.')])
To get the chunks look for subtrees within the chunked outputs. From the above output, the Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]) indicates the chunk.
This tutorial site is pretty helpful to explain the chunking process in NLTK: http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf.
For official documentation, see http://www.nltk.org/howto/chunk.html
Even though the above links have all kinds. But hope this is still helpful for someone, added a few that are missed on other links.
CC: Coordinating conjunction
CD: Cardinal number
DT: Determiner
EX: Existential there
FW: Foreign word
IN: Preposition or subordinating conjunction
JJ: Adjective
VP: Verb Phrase
JJR: Adjective, comparative
JJS: Adjective, superlative
LS: List item marker
MD: Modal
NN: Noun, singular or mass
NNS: Noun, plural
PP: Preposition Phrase
NNP: Proper noun, singular Phrase
NNPS: Proper noun, plural
PDT: Pre determiner
POS: Possessive ending
PRP: Personal pronoun Phrase
PRP: Possessive pronoun Phrase
RB: Adverb
RBR: Adverb, comparative
RBS: Adverb, superlative
RP: Particle
S: Simple declarative clause
SBAR: Clause introduced by a (possibly empty) subordinating conjunction
SBARQ: Direct question introduced by a wh-word or a wh-phrase.
SINV: Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.
SQ: Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.
SYM: Symbol
VBD: Verb, past tense
VBG: Verb, gerund or present participle
VBN: Verb, past participle
VBP: Verb, non-3rd person singular present
VBZ: Verb, 3rd person singular present
WDT: Wh-determiner
WP: Wh-pronoun
WP: Possessive wh-pronoun
WRB: Wh-adverb
As told by Alvas above, these tags are part-of-speech which tells whether a word/phrase is Noun phrase,Adverb,determiner,verb etc...
Here are the POS Tag details you can refer.
Chunking recovers the phrased from the Part of speech tags
You can refer this link for reading for about chunking.

custom tagging with nltk

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:
>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.
So anyway, my question is one of:
Is there a better tagger for this type of grammar?
Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form?
Is there a way to train a tagger?
Is there a better way altogether?
One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:
>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
Then you get
>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]
This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.
Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.
For example, consider the three sentences:
select the files
use the select function on the sockets
the select was good
Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.
import nltk.tag, nltk.data
from nltk import word_tokenize
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
def evaluate(tagger, sentences):
good,total = 0,0.
for sentence,func in sentences:
tags = tagger.tag(nltk.word_tokenize(sentence))
print tags
good += func(tags)
total += 1
print 'Accuracy:',good/total
sentences = [
('select the files', lambda tags: ('select', 'VB') in tags),
('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),
('the select was good', lambda tags: ('select', 'NN') in tags),
]
train_sents = [
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],
[('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],
[('the', 'DT'), ('select', 'NN'), ('files', 'NNS')],
]
tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger)
evaluate(tagger, sentences)
#model = tagger._context_to_tag
Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.
See Jacob's answer.
In later versions (at least nltk 3.2) nltk.tag._POS_TAGGER does not exist. The default taggers are usually downloaded into the nltk_data/taggers/ directory, e.g.:
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
Usage is as follows.
>>> import nltk.tag, nltk.data
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
>>> default_tagger = nltk.data.load(tagger_path)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
See also: How to do POS tagging using the NLTK POS tagger in Python.
Bud's answer is correct. Also, according to this link,
if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path.
Meaning, you can just say
tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
default_tagger = nltk.data.load(tagger_path)

Reading and writing POS tagged sentences from text files using NLTK and Python

Does anyone know if there is an existing module or easy method for reading and writing part-of-speech tagged sentences to and from text files? I'm using python and the Natural Language Toolkit (NLTK). For example, this code:
import nltk
sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]
print tagged
Returns this nested list:
[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.')], [('Some', 'DT'), ('years', 'NNS'), ('ago', 'RB'), ('-', ':'), ('never', 'RB'), ('mind', 'VBP'), ('how', 'WRB'), ('long', 'JJ'), ('precisely', 'RB'), ('-', ':'), ('having', 'VBG'), ('little', 'RB'), ('or', 'CC'), ('no', 'DT'), ('money', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('purse', 'NN'), (',', ','), ('and', 'CC'), ('nothing', 'NN'), ('particular', 'JJ'), ('to', 'TO'), ('interest', 'NN'), ('me', 'PRP'), ('on', 'IN'), ('shore', 'NN'), (',', ','), ('I', 'PRP'), ('thought', 'VBD'), ('I', 'PRP'), ('would', 'MD'), ('sail', 'VB'), ('about', 'IN'), ('a', 'DT'), ('little', 'RB'), ('and', 'CC'), ('see', 'VB'), ('the', 'DT'), ('watery', 'NN'), ('part', 'NN'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('.', '.')]]
I know I could easily dump this into a pickle, but I really want to export this as a segment of a larger text file. I'd like to be able to export the list to a text file, and then return to it later, parse it, and recover the original list structure. Are there any built in functions in the NLTK for doing this? I've looked but can't find any...
Example output:
<headline>Article headline</headline>
<body>Call me Ishmael...</body>
<pos_tags>[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP')...</pos_tags>
The NLTK has a standard file format for tagged text. It looks like this:
Call/NNP me/PRP Ishmael/NNP ./.
You should use this format, since it allows you to read your files with the NLTK's TaggedCorpusReader and other similar classes, and get the full range of corpus reader functions. Confusingly, there is no high-level function in the NLTK for writing a tagged corpus in this format, but that's probably because it's pretty trivial:
for sent in tagged:
print " ".join(word+"/"+tag for word, tag in sent)
(The NLTK does provide nltk.tag.tuple2str(), but it only handles one word-- it's simpler to just type word+"/"+tag).
If you save your tagged text in one or more files fileN.txt in this format, you can read it back with nltk.corpus.reader.TaggedCorpusReader like this:
mycorpus = nltk.corpus.reader.TaggedCorpusReader("path/to/corpus", "file.*\.txt")
print mycorpus.fileids()
print mycorpus.sents()[0]
for sent in mycorpus.tagged_sents():
<etc>
Note that the sents() method gives you the untagged text, albeit a bit oddly spaced. There's no need to include both tagged and untagged versions in the file, as in your example.
The TaggedCorpusReader doesn't support file headers (for the title etc.), but if you really need that you can derive your own class that reads the file metadata and then handles the rest like TaggedCorpusReader.
It seems like using pickle.dumps and inserting its output into your text file, perhaps with a tag wrapper for automated loading would satisfy your requirements.
Can you be more specific about what you would like the text output to look like?
Are you aiming for something that is more human-readable?
EDIT: adding some code
from xml.dom.minidom import Document, parseString
import nltk
sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]
# Write to xml string
doc = Document()
base = doc.createElement("Document")
doc.appendChild(base)
headline = doc.createElement("headline")
htext = doc.createTextNode("Article Headline")
headline.appendChild(htext)
base.appendChild(headline)
body = doc.createElement("body")
btext = doc.createTextNode(sentences)
headline.appendChild(btext)
base.appendChild(body)
pos_tags = doc.createElement("pos_tags")
tagtext = doc.createTextNode(repr(tagged))
pos_tags.appendChild(tagtext)
base.appendChild(pos_tags)
xmlstring = doc.toxml()
# Read back tagged
doc2 = parseString(xmlstring)
el = doc2.getElementsByTagName("pos_tags")[0]
text = el.firstChild.nodeValue
tagged2 = eval(text)
print "Equal? ", tagged == tagged2

Categories