context free grammar tree as a feature - python

Is there a way to use the output of StanfordParser as a feature for my models?
The output of StanfordParser looks like this:
Sentence: 'hello how are you?'
Result - [Tree('ROOT', [Tree('SBARQ', [Tree('WHADVP', [Tree('RB', ['hello']), Tree('WRB', ['how'])]), Tree('SQ', [Tree('VBP', ['are']), Tree('NP', [Tree('PRP', ['you'])])]), Tree('.', ['?'])])])]
Result[0] - (ROOT (SBARQ
(WHADVP (RB hello) (WRB how))
(SQ (VBP are) (NP (PRP you)))
(. ?)))

Related

An error in implementing regex function on a list

I was trying to implement a regex on a list of grammar tags in python, for finding the tense form of the list of grammar. And I wrote the following code to implement it.
Data preprocessing:
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tags = []
for i in range(len(tagged)):
t = tagged[i]
tags.append(t[1])
print(tags)
regex formula i.e. to be implemented
grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
Function to implement the regex on the list tags
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(grammar, tags)
But it returned an error as:
Traceback (most recent call last):
File "/home/samar/Desktop/twitter_tense/main.py", line 35, in <module>
check_grammar(grammar, tags)
File "/home/samar/Desktop/twitter_tense/main.py", line 31, in check_grammar
result = cp.parse(tags)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1276, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1083, in parse
chunkstr = ChunkString(chunk_struct)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 105, in _tag
raise ValueError("chunk structures must contain tagged " "tokens or trees")
ValueError: chunk structures must contain tagged tokens or trees
Your call to the cp.parse() function expects each of the tokens in your sentence to be tagged, however, the tags list you created only contains the tags but not the tokens as well, hence your ValueError. The solution is to instead pass the output from the pos_tag() call (i.e. tagged) to your check_grammar call (see below).
Solution
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
print(tagged)
# Output
>>> [('He', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('been', 'VBN'), ('doing', 'VBG'), ('his', 'PRP$'), ('homework', 'NN'), ('.', '.')]
my_grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}"""
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(my_grammar, tagged)
Output
>>> (S
>>> He/PRP
>>> (Future_Perfect_Continuous will/MD have/VB been/VBN doing/VBG)
>>> his/PRP$
>>> homework/NN
>>> ./.)

Removing stop words from string using spacy in diffrent languages

I have an array of strings in different languages and I would like to remove stop words from these strings.
example of string :
["mai fostul președinte egiptean mohamed morsi ", "em bon jovi lançou o álbum have a nice day a ", " otok škulj är en ö i kroatien den ligger i län"...]
this is the list of languages I'm willing to use :
['French',
'Spanish',
'Thai',
'Russian',
'Persian',
'Indonesian',
'Arabic',
'Pushto',
'Kannada',
'Danish',
'Japanese',
'Malayalam',
'Latin',
'Romanian',
'Swedish',
'Portugese',
'English',
'Turkish',
'Tamil',
'Urdu',
'Korean',
'German',
'Greek',
'Italian',
'Chinese',
'Dutch',
'Estonian',
'Hindi']
I am using Spacy library, but I'm looking for something that support multiple languages.
what I have tried already:
import pandas as pd
import nltk
nltk.download('punkt')
import spacy
nlp = spacy.load("xx_ent_wiki_sm")
from spacy.tokenizer import Tokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
doc = nlp("This is a sentence about Facebook.")
print([(ent.text, ent.label) for ent in doc.ents])
all_stopwords = nlp.Defaults.stop_words
all_stopwords = nlp.Defaults.stop_words
data_text=df1['Text'] #here where i store my strings
for x in data_text:
text_tokens = word_tokenize(x)
tokens_without_sw=[word for word in text_tokens if not word inall_stopwords]
print(tokens_without_sw)

NLTK : From string to Tree with "slash-tokens" word/POS?

The tree pretty print of the nltk.Tree class prints in the following format :
print spacy2tree(nlp(u'Williams is a defensive coach') )
(S
(SUBJ Williams/NNP)
(PRED is/VBZ test/VBN)
a/DT
defensive/JJ
coach/NN)
as Tree :
spacy2tree(nlp(u'Williams is a defensive coach') )
Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]),
Tree('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])
but dosen't ingest it correctly :
tfs = spacy2tree(nlp(u'Williams is a defensive coach') ).pformat()
Tree.fromstring(tfs)
Tree('S', [Tree('SUBJ', ['Williams/NNP']),
Tree('PRED', ['is/VBZ', 'test/VBN']), 'a/DT', 'defensive/JJ', 'coach/NN'])
example :
correct incorrect
('SUBJ', [(u'Williams', u'NNP')]) =vs=> ('SUBJ', ['Williams/NNP'])
('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]) =vs=> ('PRED', ['is/VBZ', 'test/VBN'])
is there a utility to ingest Tree from string correctly ??
Seems that I figured it out :
: Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/')))
: Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]),
Tree('PRED', [(u'is', u'VBZ'), (u'test', u'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])
So now this works correctly too :
: tree2conlltags(Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/'))))
:
[(u'Williams', u'NNP', u'B-SUBJ'),
(u'is', u'VBZ', u'B-PRED'),
(u'test', u'VBN', u'I-PRED'),
(u'a', u'DT', u'O'),
(u'defensive', u'JJ', u'O'),
(u'coach', u'NN', u'O')]

knitr + python plot tree diagram

I am an R users who is beginning to work in Python. I am trying o use knitr to knit a Python file and capture a tree diagram but it is not working. Here is the .Rnw (Latex based) file I am trying to knit:
\documentclass{article}
\begin{document}
Hello world!
<<r test-python1, engine='python'>>=
x = 'hello, python world!'
print(x)
#
<<r test-python2, engine='python', echo=FALSE>>=
import nltk
from nltk.tree import *
from nltk.draw import tree
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
result
#
<<python>>=
x = 'hello, python world!'
print(x.split(' '))
#
\end{document}
\end{document}
But all that is returned is:
It appears that nktl can't be found but I run it in spyder just fine and plot the tree diagram. What do I need to do to include this diagram in the Rnw file for output to a pdf? The x.split error indicates Python is found but I'm not importing correctly in knitr.
I am using Python 3.4.3 64 bit for Windows 7.
Not sure if knitr does Python graphics per: knitr: python engine output not in .md or .html
If not here's the way to solve this:
\documentclass{article}
\begin{document}
Hello world!
<<r test-python1, engine='python'>>=
x = 'hello, python world!'
print(x)
#
<<r test-python2, engine='python', echo=FALSE>>=
import nltk
from nltk import Tree
from nltk.draw.util import CanvasFrame
from nltk.draw import TreeWidget
dp1 = Tree('dp', [Tree('d', ['the']), Tree('np', ['dog'])])
dp2 = Tree('dp', [Tree('d', ['the']), Tree('np', ['cat'])])
vp = Tree('vp', [Tree('v', ['chased']), dp2])
vp = Tree('vp', [Tree('v', ['chased']), dp2])
sentence = Tree('s', [dp1, vp])
cf = CanvasFrame()
tc = TreeWidget(cf.canvas(),sentence)
cf.add_widget(tc,10,10) # (10,10) offsets
cf.print_to_file('cache/tree.ps')
cf.destroy()
#
<<r r-code>>=
fls <- file.path(getwd(), c('cache/tree.ps', 'cache/tree.png'))
system(sprintf('"C:/Program Files/ImageMagick-6.9.0-Q16/convert.exe" %s %s', fls[1], fls[2]))
#
\begin{figure}[!ht]
\centering
\includegraphics[scale=.71]{cache/tree.png}
\caption{yay!} \label{regexviz}}
\end{figure}
<<r test-python3, engine='python'>>=
x = 'hello, python world!'
print(x.split(' '))
#
\end{document}
\end{document}

How to use Stanford Parser to parse Chinese texts correctly in Python?

I am employing Stanford Parser to parse Chinese texts. I want to extract the Context-free Grammar Production Rules from the input Chinese texts.
I set my environment just as Stanford Parser and NLTK.
My code is below:
from nltk.parse import stanford
parser = stanford.StanfordParser(path_to_jar='/home/stanford-parser-full-2013-11-12/stanford-parser.jar',
path_to_models_jar='/home/stanford-parser-full-2013-11-12/stanford-parser-3.3.0-models.jar',
model_path="/home/stanford-parser-full-2013-11-12/chinesePCFG.ser.gz",encoding='utf8')
text = '我 对 这个 游戏 有 一 点 上瘾。'
sentences = parser.raw_parse_sents(unicode(text, encoding='utf8'))
However,when I try to
print sentences
I get
[Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PN', ['\u6211'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VA', ['\u5bf9'])])])]), Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PN', ['\u8fd9'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('QP', [Tree('CLP', [Tree('M', ['\u4e2a'])])])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u6e38'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NN', ['\u620f'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VE', ['\u6709'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('QP', [Tree('CD', ['\u4e00'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u70b9'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u4e0a'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NN', ['\u763e'])])])]), Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PU', ['\u3002'])])])])]
in which, Chinese words are divided separately from each other. There should be 9 subtrees but in fact 12 subtrees are returned. Could anyone show me what the problem is?
Continue, I try to collect all Context-free Grammar Production Rules from it.
for subtree in sentences:
for production in subtree.productions():
lst.append(production)
print lst
[ROOT -> IP, IP -> NP, NP -> PN, PN -> '\u6211', ROOT -> IP, IP -> VP, VP -> VA, VA -> '\u5bf9', ROOT -> IP, IP -> NP, NP -> PN, PN -> '\u8fd9', ROOT -> IP, IP -> VP, VP -> QP, QP -> CLP, CLP -> M, M -> '\u4e2a', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u6e38', ROOT -> FRAG, FRAG -> NP, NP -> NN, NN -> '\u620f', ROOT -> IP, IP -> VP, VP -> VE, VE -> '\u6709', ROOT -> FRAG, FRAG -> QP, QP -> CD, CD -> '\u4e00', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u70b9', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u4e0a', ROOT -> FRAG, FRAG -> NP, NP -> NN, NN -> '\u763e', ROOT -> IP, IP -> NP, NP -> PU, PU -> '\u3002']
But still Chinese words are divided separately.
Since I do not have much knowledge on Java, I have to use Python interface to implement my task.I really need help from stackoverflow community. Could anyone help me with it?
I have found the solution:
use parser.raw_parse instead of parser.raw_parse_sents will solve the problem. Because parser.raw_parse_sents is used for list.

Categories