Taken reference Why doesn't [01-12] range work as expected?
m = re.search(r"(\w+)\[([0-9]+)\:([0-9]+)\]", DUNESX[01:44])
or
m = re.search(r"(\w+)\[(0[1-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9])\:([0-9]+)\]", DUNESX[01:44])
or
m = re.search(r"(\w+)\[(0?[1-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9])\:([0-9]+)\]", DUNESX[01:44])
or
m = re.search(r"(\w+)\[([0-1][0-9]+)\:([0-9]+)\]", DUNESX[01:44])
But output from above expressions are
['DUNESX1', 'DUNESX2', 'DUNESX3', 'DUNESX4', 'DUNESX5', 'DUNESX6', 'DUNESX7', 'DUNESX8', 'DUNESX9', 'DUNESX10', 'DUNESX11', 'DUNESX12', 'DUNESX13', 'DUNESX14', 'DUNESX15', 'DUNESX16', 'DUNESX17', 'DUNESX18', 'DUNESX19', 'DUNESX20', 'DUNESX21', 'DUNESX22', 'DUNESX23', 'DUNESX24', 'DUNESX25', 'DUNESX26', 'DUNESX27', 'DUNESX28', 'DUNESX29', 'DUNESX30', 'DUNESX31', 'DUNESX32', 'DUNESX33', 'DUNESX34', 'DUNESX35', 'DUNESX36', 'DUNESX37', 'DUNESX38', 'DUNESX39', 'DUNESX40', 'DUNESX41', 'DUNESX42', 'DUNESX43', 'DUNESX44']
Doesn't provide desired output like
['DUNESX01', 'DUNESX02', 'DUNESX03', 'DUNESX04', 'DUNESX05', 'DUNESX06', 'DUNESX07', 'DUNESX08', 'DUNESX09', 'DUNESX10', 'DUNESX11', 'DUNESX12', 'DUNESX13', 'DUNESX14', 'DUNESX15', 'DUNESX16', 'DUNESX17', 'DUNESX18', 'DUNESX19', 'DUNESX20', 'DUNESX21', 'DUNESX22', 'DUNESX23', 'DUNESX24', 'DUNESX25', 'DUNESX26', 'DUNESX27', 'DUNESX28', 'DUNESX29', 'DUNESX30', 'DUNESX31', 'DUNESX32', 'DUNESX33', 'DUNESX34', 'DUNESX35', 'DUNESX36', 'DUNESX37', 'DUNESX38', 'DUNESX39', 'DUNESX40', 'DUNESX41', 'DUNESX42', 'DUNESX43', 'DUNESX44']
Complete code is
import re
group_list = ['DUNESX[01:44]']
host_list = getgrandchild(group_list)
def getgrandchild(child):
nodelist = []
if child is None:
return
for nodes in child:
print(nodes)
if re.match(r".*(\[[0-1][0-9]+\:[0-9]+\])", nodes):
m = re.search(r"(\w+)\[([0-9]+)\:([0-9]+)\]", nodes)
lb = int(m.group(2))
ub = int(m.group(3))
for i in range(lb, ub+1):
nodelist.append(m.group(1)+str(i))
elif re.match(r"(\w+)", nodes):
m = re.search(r"(\w+)", nodes)
nodelist.append(m.group(1))
I think I understand what you are trying to do. Here is a code that helps:
def getgrandchild(child):
nodelist = []
for nodes in child:
m = re.search("(\w+)\[([0-9]+)\:([0-9]+)\]",nodes)
for i in range(int(m.group(2)),int(m.group(3))+1):
nodelist.append(m.group(1)+str(i).zfill(len(m.group(2))))
return nodelist
You can see i've skipped some steps, but you can have them. I've focused on the main part.
We use zfill to add numbers of the form '001' or '01', which is explained here.
So, for this code, if you give:
getgrandchild(['DUNESX[01:44]'])
you get:
['DUNESX01', 'DUNESX02', 'DUNESX03', 'DUNESX04', 'DUNESX05', 'DUNESX06', 'DUNESX07', 'DUNESX08', 'DUNESX09', 'DUNESX10', 'DUNESX11', 'DUNESX12', 'DUNESX13', 'DUNESX14', 'DUNESX15', 'DUNESX16', 'DUNESX17', 'DUNESX18', 'DUNESX19', 'DUNESX20', 'DUNESX21', 'DUNESX22', 'DUNESX23', 'DUNESX24', 'DUNESX25', 'DUNESX26', 'DUNESX27', 'DUNESX28', 'DUNESX29', 'DUNESX30', 'DUNESX31', 'DUNESX32', 'DUNESX33', 'DUNESX34', 'DUNESX35', 'DUNESX36', 'DUNESX37', 'DUNESX38', 'DUNESX39', 'DUNESX40', 'DUNESX41', 'DUNESX42', 'DUNESX43', 'DUNESX44']
Also, if you give:
getgrandchild(['PYTHON[001:025]'])
you get:
['PYTHON001', 'PYTHON002', 'PYTHON003', 'PYTHON004', 'PYTHON005', 'PYTHON006', 'PYTHON007', 'PYTHON008', 'PYTHON009', 'PYTHON010', 'PYTHON011', 'PYTHON012', 'PYTHON013', 'PYTHON014', 'PYTHON015', 'PYTHON016', 'PYTHON017', 'PYTHON018', 'PYTHON019', 'PYTHON020', 'PYTHON021', 'PYTHON022', 'PYTHON023', 'PYTHON024', 'PYTHON025']
Related
last time I've gotten some help on making a website name generator. I feel bad but i'm stuck at the moment and I need some help again to improve it. in my code there's a .txt file called combined which included these lines.
After that i created a variable to add to the domain
web = 'web'
suffix = 'co.id'
And then i write it out so that the it would print the line output to the Combined.txt
output_count = 50
subdomain_count = 2
for i in range(output_count):
out = []
for j in range(subdomain_count):
out.append(random.choice(Test))
out.append(web)
out.append(suffix)
Example.write('.'.join(out)+"\n")
with open("dictionaries/examples.txt") as f:
websamples = [line.rstrip() for line in f]
I want the output where instead of just login.download.web.co.id there would be more variety like login-download.web.co.id or login.download-web.co.id In the code i used Example.write('.'.join(out)+"\n") so that the. would be a separator for each characters. I was thinking of adding more, by making a similar code line and save it to a different .txt files but I feel like it would be too long. Is there a way where I can variate each character separation with this symbol - or _ instead of just a . in the output?
Thanks!
Sure just iterate through a list of delimiters to add each of them to the output.
web = 'web'
suffix = 'co.id'
output_count = 50
subdomain_count = 2
delimeters = [ '-', '.']
for i in range(output_count):
out = []
for j in range(subdomain_count):
out.append(random.choice(Test))
for delimeter in delimeters:
addr = delimeter.join(out)
addrs = '.'.join([addr, web, suffix])
print(addrs)
Example.write(addrs + '\n')
output
my_pay.web.co.id
my-pay.web.co.id
my.pay.web.co.id
pay_download.web.co.id
pay-download.web.co.id
pay.download.web.co.id
group_login.web.co.id
group-login.web.co.id
group.login.web.co.id
install_group.web.co.id
install-group.web.co.id
install.group.web.co.id
...
...
update
import itertools
Test = ['download', 'login', 'my', 'ip', 'site', 'ssl', 'pay', 'install']
delimeters = [ '-', '.']
web = 'web'
suffix = 'co.id'
output_count = 50
subdomain_count = 2
for combo in itertools.combinations(Test, 2):
out = ''
for i, d in enumerate(delimeters):
out = d.join(combo)
out = delimeters[i-1].join([out, web])
addr = '.'.join([out, suffix])
print(addr)
# Example.write(addr+'\n')
output
download-login.web.co.id
download.login-web.co.id
download-my.web.co.id
download.my-web.co.id
download-ip.web.co.id
download.ip-web.co.id
download-site.web.co.id
download.site-web.co.id
download-ssl.web.co.id
download.ssl-web.co.id
download-pay.web.co.id
download.pay-web.co.id
download-install.web.co.id
download.install-web.co.id
login-my.web.co.id
login.my-web.co.id
login-ip.web.co.id
login.ip-web.co.id
login-site.web.co.id
login.site-web.co.id
login-ssl.web.co.id
login.ssl-web.co.id
login-pay.web.co.id
login.pay-web.co.id
login-install.web.co.id
login.install-web.co.id
my-ip.web.co.id
my.ip-web.co.id
my-site.web.co.id
my.site-web.co.id
my-ssl.web.co.id
my.ssl-web.co.id
my-pay.web.co.id
my.pay-web.co.id
my-install.web.co.id
my.install-web.co.id
ip-site.web.co.id
ip.site-web.co.id
ip-ssl.web.co.id
ip.ssl-web.co.id
ip-pay.web.co.id
ip.pay-web.co.id
ip-install.web.co.id
ip.install-web.co.id
site-ssl.web.co.id
site.ssl-web.co.id
site-pay.web.co.id
site.pay-web.co.id
site-install.web.co.id
site.install-web.co.id
ssl-pay.web.co.id
ssl.pay-web.co.id
ssl-install.web.co.id
ssl.install-web.co.id
pay-install.web.co.id
pay.install-web.co.id
As an alternative of replacing the final output, you could make the seperator random:
import random
seperators = ['-', '_', '.']
Example.write(random.choice(seperators).join(out)+"\n")
In order to ensure compliance with RFC 1035 I would suggest:
from random import choices as CHOICES, choice as CHOICE
output_count = 50
subdomain_count = 2
web = 'web'
suffix = 'co.id'
dotdash = '.-'
filename = 'output.txt'
Test = [
'auth',
'access',
'account',
'admin'
# etc
]
with open(filename, 'w') as output:
for _ in range(output_count):
sd = CHOICE(dotdash).join(CHOICES(Test, k=subdomain_count))
print('.'.join((sd, web, suffix)), file=output)
I would like to trasnform some pretty simple affirmative sentences into general questions (the language of choise is Spanish). Consider the following example:
Esto es muy difícil. -> Es esto muy difícil?
So I just need to shift the position of subject and predicate (wherever they are).
Normally it can be done with the shift_before_node() method:
pron_node, aux_node = tree.descendants[0], tree.descendants[1]
aux_node.shift_before_node(pron_node)
However, if I want to automate the process (because subject and predicate will not always be in the same position) I need to create a cycle (See The Problem paragraph below) for each node of a tree, where it checks that if node's part of speech (upos) is a PRON or PROPN, and it is followed (not necessarily directly) by a node which is a VERB or AUX, it needs to shift the second node before the first one (like in the example above). But, I dont know how to implement it into cycle. Any suggestions?
Here is my code so far (done in Google Colab). I apologize for excluding some of the console text, otherwise it would be too lengthy:
Request to UDPipe server
import requests
response = requests.get("http://lindat.mff.cuni.cz/services/udpipe/api/models")
info = response.json()
info
for key, data in info["models"].items():
if "spanish" in key:
print(key, data)
params = {"tokenizer": "", "tagger": "", "parser": "", "model": "spanish-gsd-ud-2.6-200830"}
text = "Esto es muy difícil."
params["data"] = text
response = requests.get("http://lindat.mff.cuni.cz/services/udpipe/api/process", params)
json_response = response.json()
parse = json_response["result"]
print(parse)
Output #1 (print (parse)):
# generator = UDPipe 2, https://lindat.mff.cuni.cz/services/udpipe
# udpipe_model = spanish-gsd-ud-2.6-200830
# udpipe_model_licence = CC BY-NC-SA
# newdoc
# newpar
# sent_id = 1
# text = Esto es muy difícil.
1 Esto este PRON _ Number=Sing|PronType=Dem 4 nsubj _ _
2 es ser AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop _ _
3 muy mucho ADV _ _ 4 advmod _ _
4 difícil difícil ADJ _ Number=Sing 0 root _ SpaceAfter=No
5 . . PUNCT _ _ 4 punct _ SpaceAfter=No
Udapi Installation:
!pip install --upgrade git+https://github.com/udapi/udapi-python.git
import os
os.environ['PATH'] += ":" + os.path.join(os.environ['HOME'], ".local/bin")
from udapi.block.read.conllu import Conllu
from udapi.core.document import Document
from udapi.block.write.textmodetrees import TextModeTrees
from io import StringIO
Building a tree:
In my understanding a tree is a variable of a built in Udapi class, which is a structured version of a parse variable, and which contains all the information about each word of a sentence - its order (ord), given form (form), initial form (lemma), part of speech (upos) and so on:
tree = Conllu(filehandle=StringIO(parse)).read_tree()
writer = TextModeTrees(attributes="ord,form,lemma,upos,feats,deprel", layout="align")
writer.process_tree(tree)
Output #2 (writer.process_tree(tree)):
# sent_id = 1
# text = Esto es muy difícil.
─┮
│ ╭─╼ 1 Esto este PRON Number=Sing|PronType=Dem nsubj
│ ┢─╼ 2 es ser AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop
│ ┢─╼ 3 muy mucho ADV _ advmod
╰─┾ 4 difícil difícil ADJ Number=Sing root
╰─╼ 5 . . PUNCT _ punct
It is also possible to print out all the dependents for each node of a given tree. As already correctly noted in the comments, tree.descendants consists of a list of nodes:
for node in tree.descendants:
print(f"{node.ord}:{node.form}")
left_children = node.children(preceding_only=True)
if len(left_children) > 0:
print("Left dependents:", end=" ")
for child in left_children:
print(f"{child.ord}:{child.form}", end=" ")
print("")
right_children = node.children(following_only=True)
if len(right_children) > 0:
print("Right dependents:", end=" ")
for child in right_children:
print(f"{child.ord}:{child.form}", end=" ")
print("")
Output #3:
1:Esto
2:es
3:muy
4:difícil
Left dependents: 1:Esto 2:es 3:muy
Right dependents: 5:.
5:.
The problem (beginning of a cycle):
for node in tree.descendants:
if node.upos == "VERB" or node.upos == "AUX":
UPDATE 1
So, I`ve come to the first somewhat complete version of a needed cycle and now it looks like this:
for i, curr_node in enumerate(nodes[1:], 1):
prev_node = nodes[i-1]
if (prev_node.upos == "PRON" or prev_node.upos == "PROPN") and (curr_node.upos == "VERB" or curr_node.upos == "AUX"):
curr_node.shift_before_node(prev_node)
But now I get this error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-a967bbd730fe> in <module>()
9
10
---> 11 for i, curr_node in enumerate(nodes[1:], 1):
12 prev_node = nodes[i-1]
13 if (prev_node.upos == "PRON" or prev_node.upos == "PROPN") and (curr_node.upos == "VERB" or curr_node.upos == "AUX"):
NameError: name 'nodes' is not defined
UPDATE 2
I tried defining nodes like that:
nodes = tree.descendants
And now my cycle compiles at least, but it still didn't do anything with the structure of a given sentence:
nodes = tree.descendants
for i, curr_node in enumerate(nodes[1:], 1):
prev_node = nodes[i-1]
if (prev_node.upos == "PRON" or prev_node.upos == "PROPN") and (curr_node.upos == "VERB" or curr_node.upos == "AUX"):
curr_node.shift_before_node(prev_node)
Checking the tree:
tree = Conllu(filehandle=StringIO(parse)).read_tree()
writer = TextModeTrees(attributes="ord,form,lemma,upos,feats,deprel", layout="align")
writer.process_tree(tree)
# sent_id = 1
# text = Esto es muy difícil.
─┮
│ ╭─╼ 1 Esto este PRON Number=Sing|PronType=Dem nsubj
│ ┢─╼ 2 es ser AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop
│ ┢─╼ 3 muy mucho ADV _ advmod
╰─┾ 4 difícil difícil ADJ Number=Sing root
╰─╼ 5 . . PUNCT _
Nothing changed.
UPDATE 3
I've also tried to check if the cycle swaps subject and predicate back again (2nd time), making the sentence look like the original one, but I guess it's not the case, becuase even if I comment the break part, flag has increased by 1 only:
nodes = tree.descendants
flag = 1
for i, curr_node in enumerate(nodes[1:], 1):
prev_node = nodes[i-1]
if ((prev_node.upos == "PRON") or (prev_node.upos == "PROPN")) and ((curr_node.upos == "VERB") or (curr_node.upos == "AUX")):
curr_node.shift_before_node(prev_node)
flag = flag + 1
# if flag == 2:
# break
print(flag)
Output
2
HOWEVER, it means, that the condition if ((prev_node.upos == "PRON") or (prev_node.upos == "PROPN")) and ((curr_node.upos == "VERB") or (curr_node.upos == "AUX")) was satisified.
Suppose there is one sentence per line in affirm.txt with affirmative Spanish sentences such as "Esto es muy difícil." or "Tus padres compraron esa casa de la que me hablaste.".
As an alternative to using the UDPipe web service, we can parse the sentences locally (I slightly prefer the es_ancora model over es_gsd):
import udapi
doc = udapi.Document('affirm.txt')
udapi.create_block('udpipe.Base', model_alias='es_ancora').apply_on_document(doc)
To make repeated experiments faster, we can now store the parsed trees to a CoNLL-U file using doc.store_conllu('affirm.conllu') and later load it using doc = udapi.Document('affirm.conllu').
To draw the trees we can use the doc.draw() method (or even tree.draw()), which is a syntactic sugar that uses TextModeTrees() behind the scenes. So to compare the sentences before and after changing the word order, we can use:
print("Original word order:")
doc.draw() # or doc.draw(attributes="ord,form,lemma,deprel,feats,misc")
for tree in doc.trees:
process_tree(tree)
print("Question-like word order:")
doc.draw()
Now comes the main work - to implement the process_tree() subroutine. Note that
We need to change the word order of the main clause only (e.g "Tus padres compraron esa casa."), not any dependent clauses (e.g. "de la que me hablaste".) So we don't want to iterate over all nodes (tree.descendants), we just need to find the main predicate (usually a verb) and its subject.
The subject does not need to be only PRON and PROPN, it can be NOUN or maybe just ADJ if the governing noun is omitted. So it is safer to just ask for deprel=nsubj (handling csubj is beyond the scope of this question).
I don't speak Spanish, but I think the rule is not as simple as moving the verb before the subject (or moving the subject after the verb). At least, we need to distinguish transitive verbs (with objects) and copula constructions. Of course, even the solution below is not perfect. It is rather an example how to use Udapi.
We should handle the nasty details like capitalization and spacing.
def process_tree(tree):
# Find the main predicate and its subject
main_predicate = tree.children[0]
nsubj = next((n for n in main_predicate.children if n.udeprel == 'nsubj'), None)
if not nsubj:
return
# Move the subject
# - after the auxiliary copula verb if present
# - or after the last object if present
# - or after the main predicate (verb)
cop = next((n for n in main_predicate.children if n.udeprel == 'cop'), None)
if cop:
nsubj.shift_after_subtree(cop)
else:
objects = [n for n in main_predicate.children if n.udeprel in ('obj', 'iobj')]
if objects:
nsubj.shift_after_subtree(objects[-1])
else:
nsubj.shift_after_node(verb)
# Fix the capitalization
nsubj_start = nsubj.descendants(add_self=True)[0]
if nsubj_start.lemma[0].islower() and nsubj_start.form[0].isupper():
nsubj_start.form = nsubj_start.form.lower()
tree.descendants[0].form = tree.descendants[0].form.capitalize()
# Add a question mark (instead of fullstop)
dots = [n for n in main_predicate.children if n.form == '.']
if not dots:
dots = [main_predicate.create_child(upos="PUNCT", deprel="punct")]
dots[-1].form = '?'
# Fix spacing
dots[-1].prev_node.misc["SpaceAfter"] = "No"
nsubj_start.prev_node.misc["SpaceAfter"] = ""
# Recompute the string representation of the sentence
tree.text = tree.compute_text()
The solution above uses Udapi as a library. An alternative would be to move the main code into a Udapi block called e.g. MakeQuestions:
from udapi.core.block import Block
class MakeQuestions(Block):
def process_tree(self, tree):
# the rest is same as in the solution above
If we store this block in the current directory in file makequestions.py, we can call it from the command line in many ways:
# parse the affirmative sentences
cat affirm.txt | udapy -s \
read.Sentences \
udpipe.Base model_alias=es_ancora \
> affirm.conllu
# visualize the output with TextModeTrees (-T)
cat affirm.conllu | udapy -T .MakeQuestions | less -R
# store the output in CoNLL-U
udapy -s .MakeQuestions < affirm.conllu > questions.conllu
# show just the plain-text sentences
udapy write.Sentences < questions.conllu > questions.txt
# visualize the differences in HTML
udapy -H \
read.Conllu files=affirm.conllu zone=affirm \
read.Conllu files=questions.conllu zone=questions \
util.MarkDiff gold_zone=affirm attributes=form ignore_parent=1 \
> differences.html
For the adjective:
"The company's customer service was terrible."
{customer service, terrible}
For the verb:
"They kept increasing my phone bill"
{phone bill, increasing}
This is a branch questions from this posting
However I'm trying to find adj and verbs corresponding to multi-token phrases/compound nouns such as "customer service" using spacy.
I'm not sure how to do this with spacy, nltk, or any other prepackaged natural language processing software, and I'd appreciate any help!
For simple examples like this, you can use spaCy's dependency parsing with a few simple rules.
First, to identify multi-word nouns similar to the examples given, you can use the "compound" dependency. After parsing a document (e.g., sentence) with spaCy, use a token's dep_ attribute to find it's dependency.
For example, this sentence has two compound nouns:
"The compound dependency identifies compound nouns."
Each token and its dependency is shown below:
import spacy
import pandas as pd
nlp = spacy.load('en')
example_doc = nlp("The compound dependency identifies compound nouns.")
for tok in example_doc:
print(tok.i, tok, "[", tok.dep_, "]")
>>>0 The [ det ]
>>>1 compound [ compound ]
>>>2 dependency [ nsubj ]
>>>3 identifies [ ROOT ]
>>>4 compound [ compound ]
>>>5 nouns [ dobj ]
>>>6 . [ punct ]
for tok in [tok for tok in example_doc if tok.dep_ == 'compound']: # Get list of
compounds in doc
noun = example_doc[tok.i: tok.head.i + 1]
print(noun)
>>>compound dependency
>>>compound nouns
The below function works for your examples. However, it will likely not work for more complicated sentences.
adj_doc = nlp("The company's customer service was terrible.")
verb_doc = nlp("They kept increasing my phone bill")
def get_compound_pairs(doc, verbose=False):
"""Return tuples of (multi-noun word, adjective or verb) for document."""
compounds = [tok for tok in doc if tok.dep_ == 'compound'] # Get list of compounds in doc
compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound'] # Remove middle parts of compound nouns, but avoid index errors
tuple_list = []
if compounds:
for tok in compounds:
pair_item_1, pair_item_2 = (False, False) # initialize false variables
noun = doc[tok.i: tok.head.i + 1]
pair_item_1 = noun
# If noun is in the subject, we may be looking for adjective in predicate
# In simple cases, this would mean that the noun shares a head with the adjective
if noun.root.dep_ == 'nsubj':
adj_list = [r for r in noun.root.head.rights if r.pos_ == 'ADJ']
if adj_list:
pair_item_2 = adj_list[0]
if verbose == True: # For trying different dependency tree parsing rules
print("Noun: ", noun)
print("Noun root: ", noun.root)
print("Noun root head: ", noun.root.head)
print("Noun root head rights: ", [r for r in noun.root.head.rights if r.pos_ == 'ADJ'])
if noun.root.dep_ == 'dobj':
verb_ancestor_list = [a for a in noun.root.ancestors if a.pos_ == 'VERB']
if verb_ancestor_list:
pair_item_2 = verb_ancestor_list[0]
if verbose == True: # For trying different dependency tree parsing rules
print("Noun: ", noun)
print("Noun root: ", noun.root)
print("Noun root head: ", noun.root.head)
print("Noun root head verb ancestors: ", [a for a in noun.root.ancestors if a.pos_ == 'VERB'])
if pair_item_1 and pair_item_2:
tuple_list.append((pair_item_1, pair_item_2))
return tuple_list
get_compound_pairs(adj_doc)
>>>[(customer service, terrible)]
get_compound_pairs(verb_doc)
>>>[(phone bill, increasing)]
get_compound_pairs(example_doc, verbose=True)
>>>Noun: compound dependency
>>>Noun root: dependency
>>>Noun root head: identifies
>>>Noun root head rights: []
>>>Noun: compound nouns
>>>Noun root: nouns
>>>Noun root head: identifies
>>>Noun root head verb ancestors: [identifies]
>>>[(compound nouns, identifies)]
I needed to solve a similar problem and I wanted to share my solution as Spacy.io custom component.
import spacy
from spacy.tokens import Token, Span
from spacy.language import Language
#Language.component("compound_chainer")
def find_compounds(doc):
Token.set_extension("is_compound_chain", default=False)
com_range = []
max_ind = len(doc)
for idx, tok in enumerate(doc):
if((tok.dep_ == "compound") and (idx < max_ind)):
com_range.append([idx, idx+1])
to_remove = []
intersections = []
for t1 in com_range:
for t2 in com_range:
if(t1 != t2):
s1 = set(t1)
s2 = set(t2)
if(len(s1.intersection(s2)) > 0):
to_remove.append(t1)
to_remove.append(t2)
union = list(s1.union(s2))
if union not in intersections:
intersections.append(union)
r = [t for t in com_range if t not in to_remove]
compound_ranges = r + intersections
spans = []
for cr in compound_ranges:
# Example cr [[0, 1], [3, 4], [12, 13], [16, 17, 18]]
entity = Span(doc, min(cr), max(cr)+1, label="compound_chain")
for token in entity:
token._.set("is_compound_chain", True)
spans.append(entity)
doc.ents = list(doc.ents) + spans
return doc
Github link: https://github.com/eboraks/job-description-nlp-analysis/blob/main/src/components/compound_chainer.py
I am trying to parse an XML document that contains repeating child elements using Python. When I attempt to parse the data, it creates an empty file. If I comment out the repeating child elements code (see bolded section in python script below), the document generates correctly. Can someone help?
XML:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<FRPerformance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<FRPerformanceShareClassCurrency>
<FundCode>00190</FundCode>
<CurrencyID>USD</CurrencyID>
<FundShareClassCode>A</FundShareClassCode>
<ReportPeriodFrequency>Quarterly</ReportPeriodFrequency>
<ReportPeriodEndDate>06/30/2012</ReportPeriodEndDate>
<Net>
<Annualized>
<Year1>-4.909000000</Year1>
<Year3>10.140000000</Year3>
<Year5>-22.250000000</Year5>
<Year10>-7.570000000</Year10>
<Year15>-4.730000000</Year15>
<Year20>-0.900000000</Year20>
<SI>1.900000000</SI>
</Annualized>
</Net>
<Gross>
<Annualized>
<Month3>1.279000000</Month3>
<YTD>7.294000000</YTD>
<Year1>-0.167000000</Year1>
<Year3>11.940000000</Year3>
<Year5>-21.490000000</Year5>
<Year10>-7.120000000</Year10>
<Year15>-4.420000000</Year15>
<Year20>-0.660000000</Year20>
<SI>2.110000000</SI>
</Annualized>
<Cumulative>
<Month1Back>2.288000000</Month1Back>
<Month2Back>-1.587000000</Month2Back>
<Month3Back>0.610000000</Month3Back>
<CurrentYear>7.294000000</CurrentYear>
<Year1Back>-2.409000000</Year1Back>
<Year2Back>13.804000000</Year2Back>
<Year3Back>20.287000000</Year3Back>
<Year4Back>-78.528000000</Year4Back>
<Year5Back>-0.101000000</Year5Back>
<Year6Back>9.193000000</Year6Back>
<Year7Back>2.659000000</Year7Back>
<Year8Back>9.208000000</Year8Back>
<Year9Back>25.916000000</Year9Back>
<Year10Back>-3.612000000</Year10Back>
</Cumulative>
<HistoricReturns>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 1997 00:00:00 -0600</Date>
<Return>32058.090000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 28 Feb 2003 00:00:00 -0600</Date>
<Return>36415.110000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 29 Feb 2008 00:00:00 -0600</Date>
<Return>49529.290000000</Return>
</HistoricReturns_Item>
<HistoricReturns_Item>
<Date>Fri, 30 Apr 1993 00:00:00 -0600</Date>
<Return>21621.500000000</Return>
</HistoricReturns_Item>
</<HistoricReturns>
Python script
## Create command line arguments for XML file and tageName
xmlFile = sys.argv[1]
tagName = sys.argv[2]
tree = ET.parse(xmlFile)
root = tree.getroot()
## Setup the file for output
saveout = sys.stdout
output_file = open('parsedXML.csv', 'w')
sys.stdout = output_file
## Parse XML
for node in root.findall(tagName):
fundCode = node.find('FundCode').text
curr = node.find('CurrencyID').text
shareClass = node.find('FundShareClassCode').text
for node2 in node.findall('./Net/Annualized'):
year1 = node2.findtext('Year1')
year3 = node2.findtext('Year3')
year5 = node2.findtext('Year5')
year10 = node2.findtext('Year10')
year15 = node2.findtext('Year15')
year20 = node2.findtext('Year20')
SI = node2.findtext('SI')
for node3 in node.findall('./Gross'):
for node4 in node3.findall('./Annualized'):
month3 = node4.findtext('Month3')
ytd = node4.findtext('YTD')
year1g = node4.findtext('Year1')
year3g = node4.findtext('Year3')
year5g = node4.findtext('Year5')
year10g = node4.findtext('Year10')
year15g = node4.findtext('Year15')
year20g = node4.findtext('Year2')
SIg = node4.findtext('SI')
for node5 in node3.findall('./Cumulative'):
month1b = node5.findtext('Month1Back')
month2b = node5.findtext('Month2Back')
month3b = node5.findtext('Month3Back')
curYear = node5.findtext('CurrentYear')
year1b = node5.findtext('Year1Back')
year2b = node5.findtext('Year2Back')
year3b = node5.findtext('Year3Back')
year4b = node5.findtext('Year4Back')
year5b = node5.findtext('Year5Back')
year6b = node5.findtext('Year6Back')
year7b = node5.findtext('Year7Back')
year8b = node5.findtext('Year8Back')
year9b = node5.findtext('Year9Back')
year10b = node5.findtext('Year10Back')
**for node6 in node.findall('./HistoricReturns'):
for node7 in node6.findall('./HistoricReturns_Item'):
hDate = node7.findall('Date')
hReturn = node7.findall('Return')**
print(fundCode, curr, shareClass,year1, year3, year5, year10, year15, year15, year20, SI,month3, ytd, year1g, year3g, year5g, year10g, year15g, year20g, SIg, month1b, month2b, month3b, curYear, year1b, year2b, year3b, year4b, year5b, year6b, year7b, year8b,year9b,year10b, hDate, hReturn)
The sample XML and the python code don't match up in terms of structure. Either
you're missing a closing </Gross> tag from the XML (which should be before the <HistoricReturns> section starts) - in which case the code is correct or
the code should be for node6 in node3.findall('./HistoricReturns'): i.e. node3 instead of node
N.B. The XML sample isn't complete (it isn't well-formed XML) because it's missing closing tags for Gross, FRPerformanceShareClassCurrency and FRPerformance so this makes it impossible to answer the question definitively. Hope this helps though.
I wanted to create a simple breadth first search algorithm, which returns the shortest path.
An actor information dictionary maps and actor to the list of movies the actor appears in:
actor_info = { "act1" : ["movieC", "movieA"], "act2" : ["movieA", "movieB"],
"act3" :["movieA", "movieB"], "act4" : ["movieC", "movieD"],
"act5" : ["movieD", "movieB"], "act6" : ["movieE"],
"act7" : ["movieG", "movieE"], "act8" : ["movieD", "movieF"],
"KevinBacon" : ["movieF"], "act10" : ["movieG"], "act11" : ["movieG"] }
The inverse of this maps movies to the list of actors appearing in them:
movie_info = {'movieB': ['act2', 'act3', 'act5'], 'movieC': ['act1', 'act4'],
'movieA': ['act1', 'act2', 'act3'], 'movieF': ['KevinBacon', 'act8'],
'movieG': ['act7', 'act10', 'act11'], 'movieD': ['act8', 'act4', 'act5'],
'movieE': ['act6', 'act7']}
so for a call
shortest_dictance("act1", "Kevin Bacon", actor_info, movie_info)
I should get 3 since act1 appears in movieC with Act4 who appears in movieD with Act8 who appears in movie F with KevinBacon. So the shortest distance is 3.
So far I have this:
def shotest_distance(actA, actB, actor_info, movie_info):
'''Return the number of movies required to connect actA and actB.
If theres no connection return -1.'''
# So we keep 2 lists of actors:
# 1.The actors that we have already investigated.
# 2.The actors that need to be investigated because we have found a
# connection beginning at actA. This list must be
# ordered, since we want to investigate actors in the order we
# discover them.
# -- Each time we put an actor in this list, we also store
# her distance from actA.
investigated = []
to_investigate = [actA]
distance = 0
while actB not in to_investigate and to_investigate!= []:
for actor in to_investigate:
to_investigated.remove(actA)
investigated.append(act)
for movie in actor_info[actor]:
for co_star in movie_info[movie]:
if co_star not in (investigated and to_investigate):
to_investigate.append(co_star)
....
....
return d
I can't figure the appropriate way to keep track of the distances discovered each of iteration of the code. Also the code seems to be very ineffecient time wise.
Firstly create one graph out of this to connect all the nodes and then run the shortest_path code(There could be an efficient graph library to do this instead of the function mentioned below, nevertheless this one is elegant) and then find out all the number of movie names from the shortest path.
for i in movie_info:
actor_info[i] = movie_info[i]
def find_shortest_path(graph, start, end, path=[]):
path = path + [start]
if start == end:
return path
if not start in graph:
return None
shortest = None
for node in graph[start]:
if node not in path:
newpath = find_shortest_path(graph, node, end, path)
if newpath:
if not shortest or len(newpath) < len(shortest):
shortest = newpath
return shortest
L = find_shortest_path(actor_info, 'act1', 'act2')
print len([i for i in L if i in movie_info])
find_shortest_path Source: http://www.python.org/doc/essays/graphs/
This looks like it works. It keeps track of a current set of movies. For each step, it looks at all of the one-step-away movies which haven't already been considered ("seen").
actor_info = { "act1" : ["movieC", "movieA"], "act2" : ["movieA", "movieB"],
"act3" :["movieA", "movieB"], "act4" : ["movieC", "movieD"],
"act5" : ["movieD", "movieB"], "act6" : ["movieE"],
"act7" : ["movieG", "movieE"], "act8" : ["movieD", "movieF"],
"KevinBacon" : ["movieF"], "act10" : ["movieG"], "act11" : ["movieG"] }
movie_info = {'movieB': ['act2', 'act3', 'act5'], 'movieC': ['act1', 'act4'],
'movieA': ['act1', 'act2', 'act3'], 'movieF': ['KevinBacon', 'act8'],
'movieG': ['act7', 'act10', 'act11'], 'movieD': ['act8', 'act4', 'act5'],
'movieE': ['act6', 'act7']}
def shortest_distance(actA, actB, actor_info, movie_info):
if actA not in actor_info:
return -1 # "infinity"
if actB not in actor_info:
return -1 # "infinity"
if actA == actB:
return 0
dist = 1
movies = set(actor_info[actA])
end_movies = set(actor_info[actB])
if movies & end_movies:
return dist
seen = movies.copy()
print "All movies with", actA, seen
while 1:
dist += 1
next_step = set()
for movie in movies:
for actor in movie_info[movie]:
next_step.update(actor_info[actor])
print "Movies with actors from those movies", next_step
movies = next_step - seen
print "New movies with actors from those movies", movies
if not movies:
return -1 # "Infinity"
# Has actorB been in any of those movies?
if movies & end_movies:
return dist
# Update the set of seen movies, so I don't visit them again
seen.update(movies)
if __name__ == "__main__":
print shortest_distance("act1", "KevinBacon", actor_info, movie_info)
The output is
All movies with act1 set(['movieC', 'movieA'])
Movies with actors from those movies set(['movieB', 'movieC', 'movieA', 'movieD'])
New movies with actors from those movies set(['movieB', 'movieD'])
Movies with actors from those movies set(['movieB', 'movieC', 'movieA', 'movieF', 'movieD'])
New movies with actors from those movies set(['movieF'])
3
Here's a version which returns a list of movies making up the minimum connection (None for no connection, and an empty list if the actA and actB are the same.)
def connect(links, movie):
chain = []
while movie is not None:
chain.append(movie)
movie = links[movie]
return chain
def shortest_distance(actA, actB, actor_info, movie_info):
if actA not in actor_info:
return None # "infinity"
if actB not in actor_info:
return None # "infinity"
if actA == actB:
return []
# {x: y} means that x is one link outwards from y
links = {}
# Start from the destination and work backward
for movie in actor_info[actB]:
links[movie] = None
dist = 1
movies = links.keys()
while 1:
new_movies = []
for movie in movies:
for actor in movie_info[movie]:
if actor == actA:
return connect(links, movie)
for other_movie in actor_info[actor]:
if other_movie not in links:
links[other_movie] = movie
new_movies.append(other_movie)
if not new_movies:
return None # Infinity
movies = new_movies
if __name__ == "__main__":
dist = shortest_distance("act1", "KevinBacon", actor_info, movie_info)
if dist is None:
print "Not connected"
else:
print "The Kevin Bacon Number for act1 is", len(dist)
print "Movies are:", ", ".join(dist)
Here's the output:
The Kevin Bacon Number for act1 is 3
Movies are: movieC, movieD, movieF