custom tagging with nltk

custom tagging with nltk - python

I'm trying to create a small english-like language for specifying tasks. The basic idea is to split a statement into verbs and noun-phrases that those verbs should apply to. I'm working with nltk but not getting the results i'd hoped for, eg:
>>> nltk.pos_tag(nltk.word_tokenize("select the files and copy to harddrive'"))
[('select', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('and', 'CC'), ('copy', 'VB'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("move the files to harddrive'"))
[('move', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
>>> nltk.pos_tag(nltk.word_tokenize("copy the files to harddrive'"))
[('copy', 'NN'), ('the', 'DT'), ('files', 'NNS'), ('to', 'TO'), ("harddrive'", 'NNP')]
In each case it has failed to realise the first word (select, move and copy) were intended as verbs. I know I can create custom taggers and grammars to work around this but at the same time I'm hesitant to go reinventing the wheel when a lot of this stuff is out of my league. I particularly would prefer a solution where non-English languages could be handled as well.
So anyway, my question is one of:
Is there a better tagger for this type of grammar?
Is there a way to weight an existing tagger towards using the verb form more frequently than the noun form?
Is there a way to train a tagger?
Is there a better way altogether?

One solution is to create a manual UnigramTagger that backs off to the NLTK tagger. Something like this:
>>> import nltk.tag, nltk.data
>>> default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
Then you get
>>> tagger.tag(['select', 'the', 'files'])
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')]
This same method can work for non-english languages, as long as you have an appropriate default tagger. You can train your own taggers using train_tagger.py from nltk-trainer and an appropriate corpus.

Jacob's answer is spot on. However, to expand upon it, you may find you need more than just unigrams.
For example, consider the three sentences:
select the files
use the select function on the sockets
the select was good
Here, the word "select" is being used as a verb, adjective, and noun respectively. A unigram tagger won't be able to model this. Even a bigram tagger can't handle it, because two of the cases share the same preceding word (i.e. "the"). You'd need a trigram tagger to handle this case correctly.
import nltk.tag, nltk.data
from nltk import word_tokenize
default_tagger = nltk.data.load(nltk.tag._POS_TAGGER)
def evaluate(tagger, sentences):
good,total = 0,0.
for sentence,func in sentences:
tags = tagger.tag(nltk.word_tokenize(sentence))
print tags
good += func(tags)
total += 1
print 'Accuracy:',good/total
sentences = [
('select the files', lambda tags: ('select', 'VB') in tags),
('use the select function on the sockets', lambda tags: ('select', 'JJ') in tags and ('use', 'VB') in tags),
('the select was good', lambda tags: ('select', 'NN') in tags),
]
train_sents = [
[('select', 'VB'), ('the', 'DT'), ('files', 'NNS')],
[('use', 'VB'), ('the', 'DT'), ('select', 'JJ'), ('function', 'NN'), ('on', 'IN'), ('the', 'DT'), ('sockets', 'NNS')],
[('the', 'DT'), ('select', 'NN'), ('files', 'NNS')],
]
tagger = nltk.TrigramTagger(train_sents, backoff=default_tagger)
evaluate(tagger, sentences)
#model = tagger._context_to_tag
Note, you can use NLTK's NgramTagger to train a tagger using an arbitrarily high number of n-grams, but typically you don't get much performance increase after trigrams.

See Jacob's answer.
In later versions (at least nltk 3.2) nltk.tag._POS_TAGGER does not exist. The default taggers are usually downloaded into the nltk_data/taggers/ directory, e.g.:
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
Usage is as follows.
>>> import nltk.tag, nltk.data
>>> tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
>>> default_tagger = nltk.data.load(tagger_path)
>>> model = {'select': 'VB'}
>>> tagger = nltk.tag.UnigramTagger(model=model, backoff=default_tagger)
See also: How to do POS tagging using the NLTK POS tagger in Python.

Bud's answer is correct. Also, according to this link,
if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path.
Meaning, you can just say
tagger_path = '/path/to/nltk_data/taggers/maxent_treebank_pos_tagger/english.pickle'
default_tagger = nltk.data.load(tagger_path)

Related

Chunking - regular expressions and trees

I'm a total noob so sorry if I'm asking something obvious. My question is twofold, or rather it's two questions in the same topic:
I'm studying nltk in Uni, and we're doing chunks. In the grammar I have on my notes the following code:
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
What is the "$" symbol for in this case? I know it's "end of the line" in regex, but what does it stand for here?
Also, in my text book there's a Tree that's been printed without using the .draw() function, to this result:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
How the heck one does that???
Thanks in advance to anybody who'll have the patience to school this noob :D

This is the code of your example:
import nltk
sentence = [('the', 'DT'), ('book', 'NN'), ('has', 'VBZ'), ('many', 'JJ'), ('chapters', 'NNS')]
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN.*>+} # noun phrase
PP: {<IN><NP>} # prepositional phrase
VP: {<MD>?<VB.*><NP|PP>} # verb phrase
CLAUSE: {<NP><VP>} # full clause
"""
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
#output
#(S(CLAUSE (NP the/DT book/NN) (VP has/VBZ (NP many/JJ chapters/NNS))))
result.draw()
The tree of:
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])
I found this link where you can learn a lot.
The $ symbol is a special character in regular expressions, and must
be backslash escaped in order to match the tag PP$.
$ Example:
Xyz$ -> Used to match the pattern xyz at the end of a string

Python - How to extract chunked elements?

I am new to python and I'm giving nltk a shot. I came across the following:
namedEnt = nltk.ne_chunk(tagged)
Where tagged is
tagged = nltk.pos_tag(words)
and words are token's of a sentence.
I would like to remove the stop words of namedEnt. I was able to first remove the stop words from the tokens and then chunk, but was not able to chunk and then remove the stop words.Is it possible ? If so, how could I do this ?
Eg: Sentence- "get me todays menu." is tagged into
('get', 'VB')
('me', 'PRP')
('todays', 'JJ')
('menu', 'NN')
('.', '.')
and I would like to get
('get','VB')('todays','JJ')('menu','NN')('.','.')

Not sure if I get what you want (if you feed filtered from the example below to nltk.ne_chunk, it probably results in incorrect output because ne_chunk expects a grammatical sentence I guess), but your input output can be achieved in the following way:
>>> import nltk
>>> words = ['Get', 'me', 'todays', 'menu', '.']
>>> tagged = nltk.pos_tag(words)
>>> tagged
[('Get', 'NN'), ('me', 'PRP'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
>>> filtered = [t for t in tagged if not t[0] in set(nltk.corpus.stopwords.words('english'))]
>>> filtered
[('Get', 'NN'), ('todays', 'VBZ'), ('menu', 'NN'), ('.', '.')]
Again, not sure how useful this actually is. NamedEnt in your example returns a tree structure, so you'd have to traverse this tree to extract whichever you are interested in, which probably means skipping over stopwords anyway...

How to do POS tagging on Telugu text?

I am doing English POS tagging for so long. It's staright forward like
>>> text = word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
But I want to do it in Telugu.
I followed this article - http://jaganadhg.freeflux.net/blog/archive/2009/10/12/nltk-and-indian-language-corpus-processing-part-ii.html
And could test few inbuilt sentences.
But I could not figure out the way to test any random Telugu text. Could some one please guide if he has experience in using NLTK for non english text.
I have total number words
telugu.pos
9999
sentences
1197
telugu.pos

Part of speech tagging and entity recognition - python

I want to perform part of speech tagging and entity recognition in python similar to Maxent_POS_Tag_Annotator and Maxent_Entity_Annotator functions of openNLP in R. I would prefer a code in python which takes input as textual sentence and gives output as different features- like number of "CC", number of "CD", number of "DT" etc.. CC, CD, DT are POS tags as used in Penn Treebank. So there should be 36 columns/features for POS tagging corresponding to 36 POS tags as in Penn Treebank POS. I want to implement this on Azure ML "Execute Python Script" module and Azure ML supports python 2.7.7. I heard nltk in python may does the job, but I am a beginner on python. Any help would be appreciated.

Take a look at NTLK book, Categorizing and Tagging Words section.
Simple example, it uses the Penn Treebank tagset:
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'),
("'s", 'POS'),
('big', 'JJ'),
('idea', 'NN'),
('is', 'VBZ'),
("n't", 'RB'),
('all', 'DT'),
('that', 'DT'),
('bad', 'JJ'),
('.', '.')]
Then you can use
from collections import defaultdict
counts = defaultdict(int)
for (word, tag) in pos_tag(word_tokenize("John's big idea isn't all that bad.")):
counts[tag] += 1
to get frequencies:
defaultdict(<type 'int'>, {'JJ': 2, 'NN': 1, 'POS': 1, '.': 1, 'RB': 1, 'VBZ': 1, 'DT': 2, 'NNP': 1})

Reading and writing POS tagged sentences from text files using NLTK and Python

Does anyone know if there is an existing module or easy method for reading and writing part-of-speech tagged sentences to and from text files? I'm using python and the Natural Language Toolkit (NLTK). For example, this code:
import nltk
sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]
print tagged
Returns this nested list:
[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.')], [('Some', 'DT'), ('years', 'NNS'), ('ago', 'RB'), ('-', ':'), ('never', 'RB'), ('mind', 'VBP'), ('how', 'WRB'), ('long', 'JJ'), ('precisely', 'RB'), ('-', ':'), ('having', 'VBG'), ('little', 'RB'), ('or', 'CC'), ('no', 'DT'), ('money', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('purse', 'NN'), (',', ','), ('and', 'CC'), ('nothing', 'NN'), ('particular', 'JJ'), ('to', 'TO'), ('interest', 'NN'), ('me', 'PRP'), ('on', 'IN'), ('shore', 'NN'), (',', ','), ('I', 'PRP'), ('thought', 'VBD'), ('I', 'PRP'), ('would', 'MD'), ('sail', 'VB'), ('about', 'IN'), ('a', 'DT'), ('little', 'RB'), ('and', 'CC'), ('see', 'VB'), ('the', 'DT'), ('watery', 'NN'), ('part', 'NN'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('.', '.')]]
I know I could easily dump this into a pickle, but I really want to export this as a segment of a larger text file. I'd like to be able to export the list to a text file, and then return to it later, parse it, and recover the original list structure. Are there any built in functions in the NLTK for doing this? I've looked but can't find any...
Example output:
<headline>Article headline</headline>
<body>Call me Ishmael...</body>
<pos_tags>[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP')...</pos_tags>

The NLTK has a standard file format for tagged text. It looks like this:
Call/NNP me/PRP Ishmael/NNP ./.
You should use this format, since it allows you to read your files with the NLTK's TaggedCorpusReader and other similar classes, and get the full range of corpus reader functions. Confusingly, there is no high-level function in the NLTK for writing a tagged corpus in this format, but that's probably because it's pretty trivial:
for sent in tagged:
print " ".join(word+"/"+tag for word, tag in sent)
(The NLTK does provide nltk.tag.tuple2str(), but it only handles one word-- it's simpler to just type word+"/"+tag).
If you save your tagged text in one or more files fileN.txt in this format, you can read it back with nltk.corpus.reader.TaggedCorpusReader like this:
mycorpus = nltk.corpus.reader.TaggedCorpusReader("path/to/corpus", "file.*\.txt")
print mycorpus.fileids()
print mycorpus.sents()[0]
for sent in mycorpus.tagged_sents():
<etc>
Note that the sents() method gives you the untagged text, albeit a bit oddly spaced. There's no need to include both tagged and untagged versions in the file, as in your example.
The TaggedCorpusReader doesn't support file headers (for the title etc.), but if you really need that you can derive your own class that reads the file metadata and then handles the rest like TaggedCorpusReader.

It seems like using pickle.dumps and inserting its output into your text file, perhaps with a tag wrapper for automated loading would satisfy your requirements.
Can you be more specific about what you would like the text output to look like?
Are you aiming for something that is more human-readable?
EDIT: adding some code
from xml.dom.minidom import Document, parseString
import nltk
sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]
# Write to xml string
doc = Document()
base = doc.createElement("Document")
doc.appendChild(base)
headline = doc.createElement("headline")
htext = doc.createTextNode("Article Headline")
headline.appendChild(htext)
base.appendChild(headline)
body = doc.createElement("body")
btext = doc.createTextNode(sentences)
headline.appendChild(btext)
base.appendChild(body)
pos_tags = doc.createElement("pos_tags")
tagtext = doc.createTextNode(repr(tagged))
pos_tags.appendChild(tagtext)
base.appendChild(pos_tags)
xmlstring = doc.toxml()
# Read back tagged
doc2 = parseString(xmlstring)
el = doc2.getElementsByTagName("pos_tags")[0]
text = el.firstChild.nodeValue
tagged2 = eval(text)
print "Equal? ", tagged == tagged2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

custom tagging with nltk - python

Related

Chunking - regular expressions and trees

Python - How to extract chunked elements?

How to do POS tagging on Telugu text?

Part of speech tagging and entity recognition - python

Reading and writing POS tagged sentences from text files using NLTK and Python

Categories

Resources