NLTK : From string to Tree with "slash-tokens" word/POS? - python

The tree pretty print of the nltk.Tree class prints in the following format :
print spacy2tree(nlp(u'Williams is a defensive coach') )
(S
(SUBJ Williams/NNP)
(PRED is/VBZ test/VBN)
a/DT
defensive/JJ
coach/NN)
as Tree :
spacy2tree(nlp(u'Williams is a defensive coach') )
Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]),
Tree('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])
but dosen't ingest it correctly :
tfs = spacy2tree(nlp(u'Williams is a defensive coach') ).pformat()
Tree.fromstring(tfs)
Tree('S', [Tree('SUBJ', ['Williams/NNP']),
Tree('PRED', ['is/VBZ', 'test/VBN']), 'a/DT', 'defensive/JJ', 'coach/NN'])
example :
correct incorrect
('SUBJ', [(u'Williams', u'NNP')]) =vs=> ('SUBJ', ['Williams/NNP'])
('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]) =vs=> ('PRED', ['is/VBZ', 'test/VBN'])
is there a utility to ingest Tree from string correctly ??

Seems that I figured it out :
: Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/')))
: Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]),
Tree('PRED', [(u'is', u'VBZ'), (u'test', u'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])
So now this works correctly too :
: tree2conlltags(Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/'))))
:
[(u'Williams', u'NNP', u'B-SUBJ'),
(u'is', u'VBZ', u'B-PRED'),
(u'test', u'VBN', u'I-PRED'),
(u'a', u'DT', u'O'),
(u'defensive', u'JJ', u'O'),
(u'coach', u'NN', u'O')]

Related

An error in implementing regex function on a list

I was trying to implement a regex on a list of grammar tags in python, for finding the tense form of the list of grammar. And I wrote the following code to implement it.
Data preprocessing:
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tags = []
for i in range(len(tagged)):
t = tagged[i]
tags.append(t[1])
print(tags)
regex formula i.e. to be implemented
grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
Function to implement the regex on the list tags
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(grammar, tags)
But it returned an error as:
Traceback (most recent call last):
File "/home/samar/Desktop/twitter_tense/main.py", line 35, in <module>
check_grammar(grammar, tags)
File "/home/samar/Desktop/twitter_tense/main.py", line 31, in check_grammar
result = cp.parse(tags)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1276, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1083, in parse
chunkstr = ChunkString(chunk_struct)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 105, in _tag
raise ValueError("chunk structures must contain tagged " "tokens or trees")
ValueError: chunk structures must contain tagged tokens or trees
Your call to the cp.parse() function expects each of the tokens in your sentence to be tagged, however, the tags list you created only contains the tags but not the tokens as well, hence your ValueError. The solution is to instead pass the output from the pos_tag() call (i.e. tagged) to your check_grammar call (see below).
Solution
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
print(tagged)
# Output
>>> [('He', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('been', 'VBN'), ('doing', 'VBG'), ('his', 'PRP$'), ('homework', 'NN'), ('.', '.')]
my_grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}"""
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(my_grammar, tagged)
Output
>>> (S
>>> He/PRP
>>> (Future_Perfect_Continuous will/MD have/VB been/VBN doing/VBG)
>>> his/PRP$
>>> homework/NN
>>> ./.)

context free grammar tree as a feature

Is there a way to use the output of StanfordParser as a feature for my models?
The output of StanfordParser looks like this:
Sentence: 'hello how are you?'
Result - [Tree('ROOT', [Tree('SBARQ', [Tree('WHADVP', [Tree('RB', ['hello']), Tree('WRB', ['how'])]), Tree('SQ', [Tree('VBP', ['are']), Tree('NP', [Tree('PRP', ['you'])])]), Tree('.', ['?'])])])]
Result[0] - (ROOT (SBARQ
(WHADVP (RB hello) (WRB how))
(SQ (VBP are) (NP (PRP you)))
(. ?)))

Cryptography with Python 3.6

I have been working on Cryptography Question. What I tried to write first code with OrderedDict is this.
from collections import OrderedDict
alphabet = OrderedDict(
[(u'a', u'ç'), (u'b', u'd'), (u'c', u'e'), (u'ç', u'f'), (u'd', u'g'), (u'e', u'g'), (u'f', u'h'),
(u'g', u'i'), (u'g', u'i'), (u'h', u'j'), (u'i', u'k'), (u'i', u'l'), (u'j', u'm'), (u'k', u'n'),
(u'l', u'o'), (u'm', u'ö'), (u'n', u'p'), (u'o', u'r'), (u'ö', u's'), (u'p', u'ş'), (u'r', u't'),
(u's', u'u'), (u'ş', u'ü'), (u't', u'v'), (u'u', u'y'), (u'ü', u'z'), (u'v', u'a'),
(u'y', u'b'), (u'z',u'c'), (9, '')])
text = 'öğtjçdç9dğp9grnvrt9jyuğbkp9içokoğr9pyp9nçtkbğt9iypoğtk9gycğpoğgkikpk9gybgyö9ağ9uğpkp9kekp9drboğ9dkt9urty9\
jçcktoçgkö9içokoğr9gç9uvçmç9dçuoçöçn9kekp9vğn9bçşöçp9iğtğnğp9\
çpçjvçtk9dyogyiyp9ağ9dy9öğvkp9grubçukpk9rnybçdkogkikp9\
çoirtkvöçoçtk9kuvğgkikp9gkogğ9nrgoçöçp9ağ9dçuoçtkpç9çpçjvçt9\
ağ9ngvççmnçç9bçcçtçn9nrgoçtkpk9uktnğvkp9kphr9çgğtğukpğ9öçko9çvöçpgkt9\jçgk9irtğbkö9uğpk'
out = []
for k, v in alphabet.items():
for i in list(text):
if i == v:
out.append(alphabet[v])
outlast = (''.join(out))
print(outlast)
I can't compare list of 'text' and alphabet values. I want to "text's every element(i) compare values and if it be 'i==values' Append i to 'out' list but as keys!" I'm using PyCharm 2017.2.2 Can you Help Me?
Actually Cryptyography Key is 3. "3 times go up on Turkish alphabet"
This is WRONG Output:
ffffffffffffffffffffffffffffffffffffffffffffgggggggggggiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiijllllllllllmmmmmmööpppppppppppprrrrrrrrrrrrrrrrrrsssssssssssşşşşşşşşşşşşşşşşşşşşşşşşşşşştttttttttttttüvvvvvvvvvvvvvvvvvvvvyyyyyyyyyyyyaaaaaaaaaaabbbbbbbbbbbbççççdddddddddeee
Expected Output:
merhaba ben doktor huseyin galileo nun kariyer gunleri duzenledigini duydum ve senin icin boyle bir soru hazirladim galileo da staja baslamak icin tek yapman gereken anahtari buldugun ve bu metin dosyasini okuyabildigin algoritmalari istedigin dilde kodlaman ve baslarina anahtar ve kdtaajkaa yazarak kodlarini sirketin info aderesine mail atmandir hadi goreyim seni
Not sure why you need OrderedDict, list of tuples works well. Here is how your code may look.
alphabet = [(u'a', u'ç'), (u'b', u'd'), (u'c', u'e'), (u'ç', u'f'), (u'd', u'g'), (u'e', u'g'), (u'f', u'h'),
(u'g', u'i'), (u'g', u'i'), (u'h', u'j'), (u'i', u'k'), (u'i', u'l'), (u'j', u'm'), (u'k', u'n'),
(u'l', u'o'), (u'm', u'ö'), (u'n', u'p'), (u'o', u'r'), (u'ö', u's'), (u'p', u'ş'), (u'r', u't'),
(u's', u'u'), (u'ş', u'ü'), (u't', u'v'), (u'u', u'y'), (u'ü', u'z'), (u'v', u'a'),
(u'y', u'b'), (u'z',u'c'),
(' ', '9')] #'9' should be quoted and be the second element of the tuple
text = 'öğtjçdç9dğp9grnvrt9jyuğbkp9içokoğr9pyp9nçtkbğt9iypoğtk9gycğpoğgkikpk9gybgyö9ağ9uğpkp9kekp9drboğ9dkt9urty9\
jçcktoçgkö9içokoğr9gç9uvçmç9dçuoçöçn9kekp9vğn9bçşöçp9iğtğnğp9\
çpçjvçtk9dyogyiyp9ağ9dy9öğvkp9grubçukpk9rnybçdkogkikp9\
çoirtkvöçoçtk9kuvğgkikp9gkogğ9nrgoçöçp9ağ9dçuoçtkpç9çpçjvçt9\
ağ9ngvççmnçç9bçcçtçn9nrgoçtkpk9uktnğvkp9kphr9çgğtğukpğ9öçko9çvöçpgkt9\jçgk9irtğbkö9uğpk'
out = []
for t in list(text): #outer loop is over the text
for v in alphabet: #and inner loop is to search char in the alphabet
if t == v[1]:
out += v[0]
outlast = (''.join(out))
print(outlast)
Output.
mrhaba bn deoktor husyin ggalilo nun kariyr ggunlri deuznldeiggini deuydeum v snin icin boyl bir soru hazirladeim ggalilo dea staja baslamak icin tk yapman ggrkn anahtari buldeuggun v bu mtin deosyasini okuyabildeiggin alggoritmalari istdeiggin deilde kodelaman v baslarina anahtar v kdetaajkaa yazarak kodelarini sirktin info adersin mail atmandeir hadei ggoryim sni
There are still some errors in your alphabet definition, I leave fixing them to you.

Converting stanford dependencies in numbered format

I am using Stanford dependency parser and the I get the following output of the sentence
I shot an elephant in my sleep
>>>python dep_parsing.py
[((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')), ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')), ((u'elephant', u'NN'), u'det', (u'an', u'DT')), ((u'shot', u'VBD'), u'nmod', (u'sleep', u'NN')), ((u'sleep', u'NN'), u'case', (u'in', u'IN')), ((u'sleep', u'NN'), u'nmod:poss', (u'my', u'PRP$'))]
However, I want the numbered tokens as output just as it is here
nsubj(shot-2, I-1)
root(ROOT-0, shot-2)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
case(sleep-7, in-5)
nmod:poss(sleep-7, my-6)
nmod(shot-2, sleep-7)
Here is my code till now.
from nltk.parse.stanford import StanfordDependencyParser
stanford_parser_dir = 'stanford-parser/'
eng_model_path = stanford_parser_dir + "stanford-parser-models/edu/stanford/nlp/models/lexparser/englishRNN.ser.gz"
my_path_to_models_jar = stanford_parser_dir + "stanford-parser-3.5.2-models.jar"
my_path_to_jar = stanford_parser_dir + "stanford-parser.jar"
dependency_parser = StanfordDependencyParser(path_to_jar=my_path_to_jar, path_to_models_jar=my_path_to_models_jar)
result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result.next()
a = list(dep.triples())
print a
How can I have such an output?
Write a recursive function that traverses your tree. As a first pass, just try assigning the numbers to the words.

How to extract chunks from BIO chunked sentences? - python

Give an input sentence, that has BIO chunk tags:
[('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed',
'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'),
('swallow', 'I-NP'), ('?', 'O')]
I would need to extract the relevant phrases out, e.g. if I want to extract 'NP', I would need to extract the fragments of tuples that contains B-NP and I-NP.
[out]:
[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]
(Note: the numbers in the extract tuples represent the token index.)
I have tried extracting it using the following code:
def extract_chunks(tagged_sent, chunk_type):
current_chunk = []
current_chunk_position = []
for idx, word_pos in enumerate(tagged_sent):
word, pos = word_pos
if '-'+chunk_type in pos: # Append the word to the current_chunk.
current_chunk.append((word))
current_chunk_position.append((idx))
else:
if current_chunk: # Flush the full chunk when out of an NP.
_chunk_str = ' '.join(current_chunk)
_chunk_pos_str = '-'.join(map(str, current_chunk_position))
yield _chunk_str, _chunk_pos_str
current_chunk = []
current_chunk_position = []
if current_chunk: # Flush the last chunk.
yield ' '.join(current_chunk), '-'.join(current_chunk_position)
tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]
print (list(extract_chunks(tagged_sent, chunk_type='NP')))
But when I have adjacent chunk of the same type:
tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]
print (list(extract_chunks(tagged_sent, chunk_type='NP')))
It outputs this:
[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]
Instead of the desired:
[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]
How can this be resolved from the above code?
Other than how it's done from the code above, is there a better solution to extract the desired chunks of a specific chunk_type?
Try this, it will extract all types of chunks with the indices of their respective words.
def extract_chunks(tagged_sent, chunk_type='NP'):
out_sen = []
for idx, word_pos in enumerate(tagged_sent):
word,bio = word_pos
boundary,tag = bio.split("-") if "-" in bio else ('','O')
if tag != chunk_type:continue
if boundary == "B":
out_sen.append([word, str(idx)])
elif boundary == "I":
out_sen[-1][0] += " "+ word
out_sen[-1][-1] += "-"+ str(idx)
else:
out_sen.append([word, str(idx)])
return out_sen
Demo:
>>> tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]
>>> output_sent = extract_chunks(tagged_sent)
>>> print map(tuple, output_sent)
[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]
def extract_chunks(tagged_sent, chunk_type):
grp1, grp2, chunk_type = [], [], "-" + chunk_type
for ind, (s, tp) in enumerate(tagged_sent):
if tp.endswith(chunk_type):
if not tp.startswith("B"):
grp2.append(str(ind))
grp1.append(s)
else:
if grp1:
yield " ".join(grp1), "-".join(grp2)
grp1, grp2 = [s], [str(ind)]
yield " ".join(grp1), "-".join(grp2)
Output:
In [2]: l = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'),
...: ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]
In [3]: list(extract_chunks(l, "NP"))
Out[3]:
[('The Mitsubishi Electric Company', '0-1-2-3'),
('Managing Director', '4-5'),
('ramen', '7')]
In [4]: l = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]
In [5]: list(extract_chunks(l, "NP"))
Out[5]: [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]
I would do it like this:
import re
def extract_chunks(tagged_sent, chunk_type):
# compiles the expression we want to match
regex = re.compile(chunk_type)
# filters matched items in a dictionary whose keys are the matched indexes
first_step = {index_:tag[0] for index_, tag in enumerate(tagged_sent) if regex.findall(tag[1])}
# builds list of lists following output format
second_step = []
for key_ in sorted(first_step.keys()):
if second_step and int(second_step [len(second_step )-1][1].split('-')[-1]) == key_ -1:
second_step[len(second_step)-1][0] += ' {0}'.format(first_step[key_])
second_step[len(second_step)-1][1] += '-{0}'.format(str(key_))
else:
second_step.append([first_step[key_], str(key_)])
# builds output in final format
return [tuple(item) for item in second_step]
You can adapt it to use generators instead of building the whole output in memory like I am doing and refactory it for better performance (I'm in a hurry so the code is far from optimal).
Hope it helps!

Categories