I am using Stanford dependency parser and the I get the following output of the sentence
I shot an elephant in my sleep
>>>python dep_parsing.py
[((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')), ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')), ((u'elephant', u'NN'), u'det', (u'an', u'DT')), ((u'shot', u'VBD'), u'nmod', (u'sleep', u'NN')), ((u'sleep', u'NN'), u'case', (u'in', u'IN')), ((u'sleep', u'NN'), u'nmod:poss', (u'my', u'PRP$'))]
However, I want the numbered tokens as output just as it is here
nsubj(shot-2, I-1)
root(ROOT-0, shot-2)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
case(sleep-7, in-5)
nmod:poss(sleep-7, my-6)
nmod(shot-2, sleep-7)
Here is my code till now.
from nltk.parse.stanford import StanfordDependencyParser
stanford_parser_dir = 'stanford-parser/'
eng_model_path = stanford_parser_dir + "stanford-parser-models/edu/stanford/nlp/models/lexparser/englishRNN.ser.gz"
my_path_to_models_jar = stanford_parser_dir + "stanford-parser-3.5.2-models.jar"
my_path_to_jar = stanford_parser_dir + "stanford-parser.jar"
dependency_parser = StanfordDependencyParser(path_to_jar=my_path_to_jar, path_to_models_jar=my_path_to_models_jar)
result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result.next()
a = list(dep.triples())
print a
How can I have such an output?
Write a recursive function that traverses your tree. As a first pass, just try assigning the numbers to the words.
Related
I was trying to implement a regex on a list of grammar tags in python, for finding the tense form of the list of grammar. And I wrote the following code to implement it.
Data preprocessing:
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
tags = []
for i in range(len(tagged)):
t = tagged[i]
tags.append(t[1])
print(tags)
regex formula i.e. to be implemented
grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}
Function to implement the regex on the list tags
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(grammar, tags)
But it returned an error as:
Traceback (most recent call last):
File "/home/samar/Desktop/twitter_tense/main.py", line 35, in <module>
check_grammar(grammar, tags)
File "/home/samar/Desktop/twitter_tense/main.py", line 31, in check_grammar
result = cp.parse(tags)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1276, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 1083, in parse
chunkstr = ChunkString(chunk_struct)
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in __init__
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 95, in <listcomp>
tags = [self._tag(tok) for tok in self._pieces]
File "/home/samar/.local/lib/python3.8/site-packages/nltk/chunk/regexp.py", line 105, in _tag
raise ValueError("chunk structures must contain tagged " "tokens or trees")
ValueError: chunk structures must contain tagged tokens or trees
Your call to the cp.parse() function expects each of the tokens in your sentence to be tagged, however, the tags list you created only contains the tags but not the tokens as well, hence your ValueError. The solution is to instead pass the output from the pos_tag() call (i.e. tagged) to your check_grammar call (see below).
Solution
from nltk import word_tokenize, pos_tag
import nltk
text = "He will have been doing his homework."
tokenized = word_tokenize(text)
tagged = pos_tag(tokenized)
print(tagged)
# Output
>>> [('He', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('been', 'VBN'), ('doing', 'VBG'), ('his', 'PRP$'), ('homework', 'NN'), ('.', '.')]
my_grammar = r"""
Future_Perfect_Continuous: {<MD><VB><VBN><VBG>}
Future_Continuous: {<MD><VB><VBG>}
Future_Perfect: {<MD><VB><VBN>}
Past_Perfect_Continuous: {<VBD><VBN><VBG>}
Present_Perfect_Continuous:{<VBP|VBZ><VBN><VBG>}
Future_Indefinite: {<MD><VB>}
Past_Continuous: {<VBD><VBG>}
Past_Perfect: {<VBD><VBN>}
Present_Continuous: {<VBZ|VBP><VBG>}
Present_Perfect: {<VBZ|VBP><VBN>}
Past_Indefinite: {<VBD>}
Present_Indefinite: {<VBZ>|<VBP>}"""
def check_grammar(grammar, tags):
cp = nltk.RegexpParser(grammar)
result = cp.parse(tags)
print(result)
result.draw()
check_grammar(my_grammar, tagged)
Output
>>> (S
>>> He/PRP
>>> (Future_Perfect_Continuous will/MD have/VB been/VBN doing/VBG)
>>> his/PRP$
>>> homework/NN
>>> ./.)
I have an array of strings in different languages and I would like to remove stop words from these strings.
example of string :
["mai fostul președinte egiptean mohamed morsi ", "em bon jovi lançou o álbum have a nice day a ", " otok škulj är en ö i kroatien den ligger i län"...]
this is the list of languages I'm willing to use :
['French',
'Spanish',
'Thai',
'Russian',
'Persian',
'Indonesian',
'Arabic',
'Pushto',
'Kannada',
'Danish',
'Japanese',
'Malayalam',
'Latin',
'Romanian',
'Swedish',
'Portugese',
'English',
'Turkish',
'Tamil',
'Urdu',
'Korean',
'German',
'Greek',
'Italian',
'Chinese',
'Dutch',
'Estonian',
'Hindi']
I am using Spacy library, but I'm looking for something that support multiple languages.
what I have tried already:
import pandas as pd
import nltk
nltk.download('punkt')
import spacy
nlp = spacy.load("xx_ent_wiki_sm")
from spacy.tokenizer import Tokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
doc = nlp("This is a sentence about Facebook.")
print([(ent.text, ent.label) for ent in doc.ents])
all_stopwords = nlp.Defaults.stop_words
all_stopwords = nlp.Defaults.stop_words
data_text=df1['Text'] #here where i store my strings
for x in data_text:
text_tokens = word_tokenize(x)
tokens_without_sw=[word for word in text_tokens if not word inall_stopwords]
print(tokens_without_sw)
The tree pretty print of the nltk.Tree class prints in the following format :
print spacy2tree(nlp(u'Williams is a defensive coach') )
(S
(SUBJ Williams/NNP)
(PRED is/VBZ test/VBN)
a/DT
defensive/JJ
coach/NN)
as Tree :
spacy2tree(nlp(u'Williams is a defensive coach') )
Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]),
Tree('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])
but dosen't ingest it correctly :
tfs = spacy2tree(nlp(u'Williams is a defensive coach') ).pformat()
Tree.fromstring(tfs)
Tree('S', [Tree('SUBJ', ['Williams/NNP']),
Tree('PRED', ['is/VBZ', 'test/VBN']), 'a/DT', 'defensive/JJ', 'coach/NN'])
example :
correct incorrect
('SUBJ', [(u'Williams', u'NNP')]) =vs=> ('SUBJ', ['Williams/NNP'])
('PRED', [(u'is', u'VBZ'), ('test', 'VBN')]) =vs=> ('PRED', ['is/VBZ', 'test/VBN'])
is there a utility to ingest Tree from string correctly ??
Seems that I figured it out :
: Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/')))
: Tree('S', [Tree('SUBJ', [(u'Williams', u'NNP')]),
Tree('PRED', [(u'is', u'VBZ'), (u'test', u'VBN')]), (u'a', u'DT'), (u'defensive', u'JJ'), (u'coach', u'NN')])
So now this works correctly too :
: tree2conlltags(Tree.fromstring(tfs, read_leaf=lambda s : tuple(s.split('/'))))
:
[(u'Williams', u'NNP', u'B-SUBJ'),
(u'is', u'VBZ', u'B-PRED'),
(u'test', u'VBN', u'I-PRED'),
(u'a', u'DT', u'O'),
(u'defensive', u'JJ', u'O'),
(u'coach', u'NN', u'O')]
I have been working on Cryptography Question. What I tried to write first code with OrderedDict is this.
from collections import OrderedDict
alphabet = OrderedDict(
[(u'a', u'ç'), (u'b', u'd'), (u'c', u'e'), (u'ç', u'f'), (u'd', u'g'), (u'e', u'g'), (u'f', u'h'),
(u'g', u'i'), (u'g', u'i'), (u'h', u'j'), (u'i', u'k'), (u'i', u'l'), (u'j', u'm'), (u'k', u'n'),
(u'l', u'o'), (u'm', u'ö'), (u'n', u'p'), (u'o', u'r'), (u'ö', u's'), (u'p', u'ş'), (u'r', u't'),
(u's', u'u'), (u'ş', u'ü'), (u't', u'v'), (u'u', u'y'), (u'ü', u'z'), (u'v', u'a'),
(u'y', u'b'), (u'z',u'c'), (9, '')])
text = 'öğtjçdç9dğp9grnvrt9jyuğbkp9içokoğr9pyp9nçtkbğt9iypoğtk9gycğpoğgkikpk9gybgyö9ağ9uğpkp9kekp9drboğ9dkt9urty9\
jçcktoçgkö9içokoğr9gç9uvçmç9dçuoçöçn9kekp9vğn9bçşöçp9iğtğnğp9\
çpçjvçtk9dyogyiyp9ağ9dy9öğvkp9grubçukpk9rnybçdkogkikp9\
çoirtkvöçoçtk9kuvğgkikp9gkogğ9nrgoçöçp9ağ9dçuoçtkpç9çpçjvçt9\
ağ9ngvççmnçç9bçcçtçn9nrgoçtkpk9uktnğvkp9kphr9çgğtğukpğ9öçko9çvöçpgkt9\jçgk9irtğbkö9uğpk'
out = []
for k, v in alphabet.items():
for i in list(text):
if i == v:
out.append(alphabet[v])
outlast = (''.join(out))
print(outlast)
I can't compare list of 'text' and alphabet values. I want to "text's every element(i) compare values and if it be 'i==values' Append i to 'out' list but as keys!" I'm using PyCharm 2017.2.2 Can you Help Me?
Actually Cryptyography Key is 3. "3 times go up on Turkish alphabet"
This is WRONG Output:
ffffffffffffffffffffffffffffffffffffffffffffgggggggggggiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiijllllllllllmmmmmmööpppppppppppprrrrrrrrrrrrrrrrrrsssssssssssşşşşşşşşşşşşşşşşşşşşşşşşşşşştttttttttttttüvvvvvvvvvvvvvvvvvvvvyyyyyyyyyyyyaaaaaaaaaaabbbbbbbbbbbbççççdddddddddeee
Expected Output:
merhaba ben doktor huseyin galileo nun kariyer gunleri duzenledigini duydum ve senin icin boyle bir soru hazirladim galileo da staja baslamak icin tek yapman gereken anahtari buldugun ve bu metin dosyasini okuyabildigin algoritmalari istedigin dilde kodlaman ve baslarina anahtar ve kdtaajkaa yazarak kodlarini sirketin info aderesine mail atmandir hadi goreyim seni
Not sure why you need OrderedDict, list of tuples works well. Here is how your code may look.
alphabet = [(u'a', u'ç'), (u'b', u'd'), (u'c', u'e'), (u'ç', u'f'), (u'd', u'g'), (u'e', u'g'), (u'f', u'h'),
(u'g', u'i'), (u'g', u'i'), (u'h', u'j'), (u'i', u'k'), (u'i', u'l'), (u'j', u'm'), (u'k', u'n'),
(u'l', u'o'), (u'm', u'ö'), (u'n', u'p'), (u'o', u'r'), (u'ö', u's'), (u'p', u'ş'), (u'r', u't'),
(u's', u'u'), (u'ş', u'ü'), (u't', u'v'), (u'u', u'y'), (u'ü', u'z'), (u'v', u'a'),
(u'y', u'b'), (u'z',u'c'),
(' ', '9')] #'9' should be quoted and be the second element of the tuple
text = 'öğtjçdç9dğp9grnvrt9jyuğbkp9içokoğr9pyp9nçtkbğt9iypoğtk9gycğpoğgkikpk9gybgyö9ağ9uğpkp9kekp9drboğ9dkt9urty9\
jçcktoçgkö9içokoğr9gç9uvçmç9dçuoçöçn9kekp9vğn9bçşöçp9iğtğnğp9\
çpçjvçtk9dyogyiyp9ağ9dy9öğvkp9grubçukpk9rnybçdkogkikp9\
çoirtkvöçoçtk9kuvğgkikp9gkogğ9nrgoçöçp9ağ9dçuoçtkpç9çpçjvçt9\
ağ9ngvççmnçç9bçcçtçn9nrgoçtkpk9uktnğvkp9kphr9çgğtğukpğ9öçko9çvöçpgkt9\jçgk9irtğbkö9uğpk'
out = []
for t in list(text): #outer loop is over the text
for v in alphabet: #and inner loop is to search char in the alphabet
if t == v[1]:
out += v[0]
outlast = (''.join(out))
print(outlast)
Output.
mrhaba bn deoktor husyin ggalilo nun kariyr ggunlri deuznldeiggini deuydeum v snin icin boyl bir soru hazirladeim ggalilo dea staja baslamak icin tk yapman ggrkn anahtari buldeuggun v bu mtin deosyasini okuyabildeiggin alggoritmalari istdeiggin deilde kodelaman v baslarina anahtar v kdetaajkaa yazarak kodelarini sirktin info adersin mail atmandeir hadei ggoryim sni
There are still some errors in your alphabet definition, I leave fixing them to you.
I am an R users who is beginning to work in Python. I am trying o use knitr to knit a Python file and capture a tree diagram but it is not working. Here is the .Rnw (Latex based) file I am trying to knit:
\documentclass{article}
\begin{document}
Hello world!
<<r test-python1, engine='python'>>=
x = 'hello, python world!'
print(x)
#
<<r test-python2, engine='python', echo=FALSE>>=
import nltk
from nltk.tree import *
from nltk.draw import tree
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
result
#
<<python>>=
x = 'hello, python world!'
print(x.split(' '))
#
\end{document}
\end{document}
But all that is returned is:
It appears that nktl can't be found but I run it in spyder just fine and plot the tree diagram. What do I need to do to include this diagram in the Rnw file for output to a pdf? The x.split error indicates Python is found but I'm not importing correctly in knitr.
I am using Python 3.4.3 64 bit for Windows 7.
Not sure if knitr does Python graphics per: knitr: python engine output not in .md or .html
If not here's the way to solve this:
\documentclass{article}
\begin{document}
Hello world!
<<r test-python1, engine='python'>>=
x = 'hello, python world!'
print(x)
#
<<r test-python2, engine='python', echo=FALSE>>=
import nltk
from nltk import Tree
from nltk.draw.util import CanvasFrame
from nltk.draw import TreeWidget
dp1 = Tree('dp', [Tree('d', ['the']), Tree('np', ['dog'])])
dp2 = Tree('dp', [Tree('d', ['the']), Tree('np', ['cat'])])
vp = Tree('vp', [Tree('v', ['chased']), dp2])
vp = Tree('vp', [Tree('v', ['chased']), dp2])
sentence = Tree('s', [dp1, vp])
cf = CanvasFrame()
tc = TreeWidget(cf.canvas(),sentence)
cf.add_widget(tc,10,10) # (10,10) offsets
cf.print_to_file('cache/tree.ps')
cf.destroy()
#
<<r r-code>>=
fls <- file.path(getwd(), c('cache/tree.ps', 'cache/tree.png'))
system(sprintf('"C:/Program Files/ImageMagick-6.9.0-Q16/convert.exe" %s %s', fls[1], fls[2]))
#
\begin{figure}[!ht]
\centering
\includegraphics[scale=.71]{cache/tree.png}
\caption{yay!} \label{regexviz}}
\end{figure}
<<r test-python3, engine='python'>>=
x = 'hello, python world!'
print(x.split(' '))
#
\end{document}
\end{document}