Parsing complex Xml Python 3.4

Parsing complex Xml Python 3.4 - python

I have the following xml :
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Suite>
<TestCase>
<TestCaseID>001</TestCaseID>
<TestCaseDescription>Hello</TestCaseDescription>
<TestSetup>
<Action>
<ActionCommand>gfdg</ActionCommand>
<TimeOut>dfgd</TimeOut>
<BamSymbol>gff</BamSymbol>
<Side>vfbgc</Side>
<PrimeBroker>fgfd</PrimeBroker>
<Size>fbcgc</Size>
<PMCode>fdgd</PMCode>
<Strategy>fdgf</Strategy>
<SubStrategy>fgf</SubStrategy>
<ActionLogEndPoint>fdgf</ActionLogEndPoint>
<IsActionResultLogged>fdgf</IsActionResultLogged>
<ValidationStep>
<IsValidated>fgdf</IsValidated>
<ValidationFormat>dfgf</ValidationFormat>
<ResponseEndpoint>gdf</ResponseEndpoint>
<ResponseParameterName>fdgfdg</ResponseParameterName>
<ResponseParameterValue>gff</ResponseParameterValue>
<ExpectedValue>fdgf</ExpectedValue>
<IsValidationResultLogged>gdfgf</IsValidationResultLogged>
<ValidationLogEndpoint>fdgf</ValidationLogEndpoint>
</ValidationStep>
</Action>
<Action>
<ActionCommand>New Order</ActionCommand>
<TimeOut>fdgf</TimeOut>
<BamSymbol>fdg</BamSymbol>
<Side>C(COVER)</Side>
<PrimeBroker>CSPB</PrimeBroker>
<Size>fdgd</Size>
<PMCode>GREE</PMCode>
<Strategy>Generalist</Strategy>
<SubStrategy>USLC</SubStrategy>
<ActionLogEndPoint>gfbhgf</ActionLogEndPoint>
<IsActionResultLogged>fdgf</IsActionResultLogged>
<ValidationStep>
<IsValidated>fdgd</IsValidated>
<ValidationFormat>dfgfd</ValidationFormat>
<ResponseEndpoint>dfgf</ResponseEndpoint>
<ResponseParameterName>fdgfd</ResponseParameterName>
<ResponseParameterValue>dfgf</ResponseParameterValue>
<ExpectedValue>fdg</ExpectedValue>
<IsValidationResultLogged>fdgdf</IsValidationResultLogged>
<ValidationLogEndpoint>fdgfd</ValidationLogEndpoint>
</ValidationStep>
</Action>
</TestCase>
</Suite>
Based on the ActionCommand i am getting either one block , the issue is could not get the sub parent tag (ValidationStep) and all its child tags . Can anyone help?
My code:
for testSetup4 in root.findall(".TestCase/TestSetup/Action"):
if testSetup4.find('ActionCommand').text == "gfdg":
for c1 in testSetup4:
t2.append(c1.tag)
v2.append(c1.text)
for k,v in zip(t2, v2):
test_case[k] = v
I am not able to get ValidationStep (sub parent) and its corresponding tags.

Simply add another loop to iterate through the <ValidationStep> node and its children. Also, you do not need the two other lists as you can update a dictionary during the parsing loop:
import xml.etree.ElementTree as et
dom = et.parse('Input.xml')
root = dom.getroot()
test_case = {}
for testSetup4 in root.findall(".TestCase/TestSetup/Action"):
if testSetup4.find('ActionCommand').text == "gfdg":
for c1 in testSetup4:
test_case[c1.tag]= c1.text
for vd in testSetup4.findall("./ValidationStep/*"):
test_case[vd.tag]= vd.text
Alternatively, use the double slash operator to search for all children including grandchildren of <Action> element:
for testSetup4 in root.findall(".TestCase/TestSetup/Action"):
if testSetup4.find('ActionCommand').text == "gfdg":
for c1 in testSetup4.findall(".//*"):
test_case[c1.tag]= c1.text

Related

How to add new nodes into XML tree, reading from a list in Python?

I am trying to read from a list and add the values as new nodes into an XML in Python
list = ['163','164','165']
and after appending list values into the node trackingnumbers,
The xml should looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<trackingrequest>
<user>TAIL</user>
<password>20</password>
<trackingnumbers>
<trackingnumber>163</trackingnumber>
<trackingnumber>164</trackingnumber>
<trackingnumber>165</trackingnumber>
</trackingnumbers>
</trackingrequest>
I have got it this far but i am stuck at creating dynamic variables inside a loop, which creates new nodes inside trackingnumbers
def GenerateXML():
root = ET.Element("trackingrequest")
m1 = ET.Element("user")
root.append(m1)
m1.text = 'TAIL'
m2 = ET.Element("password")
root.append(m2)
m2.text = '20'
m3 = ET.Element("trackingnumbers")
root.append(m3)
d = {}
for i in range(list):
d["trackingid_{0}".format(i)] = ET.SubElement(m3, "trackingnumber")
d['trackingid_1'].text = i*2
tree = ET.ElementTree(root)

The idea is to find trackingnumbers and add the required sub elements
import xml.etree.ElementTree as ET
lst = ['163', '164', '165']
xml = '''<?xml version="1.0" encoding="UTF-8"?>
<trackingrequest>
<user>TAIL</user>
<password>20</password>
<trackingnumbers>
</trackingnumbers>
</trackingrequest>'''
root = ET.fromstring(xml)
tracking_numbers = root.find('.//trackingnumbers')
for num in lst:
tn = ET.SubElement(tracking_numbers, 'trackingnumber')
tn.text = num
ET.dump(root)
output
<?xml version="1.0" encoding="UTF-8"?>
<trackingrequest>
<user>TAIL</user>
<password>20</password>
<trackingnumbers>
<trackingnumber>163</trackingnumber>
<trackingnumber>164</trackingnumber>
<trackingnumber>165</trackingnumber>
</trackingnumbers>
</trackingrequest>

Iterating over XML and selecting specific element tree content

I have an XML which looks like this:
<openie>
<triple confidence="1.000">
<subject begin="0" end="1">
<text>PAF</text>
<lemma>paf</lemma>
</subject>
<relation begin="1" end="2">
<text>gets</text>
<lemma>get</lemma>
</relation>
<object begin="2" end="6">
<text>name of web site</text>
<lemma>name of web site</lemma>
</object>
</triple>
<triple confidence="1.000">
<subject begin="0" end="1">
<text>PAF</text>
<lemma>paf</lemma>
</subject>
<relation begin="1" end="2">
<text>gets</text>
<lemma>get</lemma>
</relation>
<object begin="2" end="3">
<text>name</text>
<lemma>name</lemma>
</object>
</triple>
</openie>
The element openie is nested in here: root>document>sentences>sentence>openie
And in my function I am trying to print triples which each contain subject, relation, object elements. Unfortunately, I can not get it to work, since I am unable to get into these three elements and their text element. Which part is wrong?
def get_openie():
print('OpenIE parser start...')
tree = ET.parse('./tmp/nlp_output.xml')
root = tree.getroot()
for triple in root.findall('./document/sentences/sentence/openie/triple'):
t_subject = triple.find('subject/text').text
t_relation = triple.find('relation/text').text
t_object = triple.get('object/text').text
print(t_subject,t_relation,t_object)
Output for two triples should look like this:
PAF gets name of web site
PAF gets name

To get your t_object you're running triple.get() instead of triple.find(). Changing that fixes your problem.
def get_openie():
print('OpenIE parser start...')
tree = ET.parse('./tmp/nlp_output.xml')
root = tree.getroot()
for triple in root.findall('./document/sentences/sentence/openie/triple'):
t_subject = triple.find('subject/text').text
t_relation = triple.find('relation/text').text
t_object = triple.find('object/text').text
print(t_subject,t_relation,t_object)

How to get the content of specific grandchild from xml file through python

Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.nih.gov" uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1307390687803.0">
-<ResponseHeader>
<Version>1.8.1</Version>
<MessageId>-421198203</MessageId>
<DateRequest>2007-11-01</DateRequest>
<TimeRequest>12:30:44</TimeRequest>
<RequestingSite>removed</RequestingSite>
<ServicingSite>removed</ServicingSite>
<TaskDescription>Second unblinded read</TaskDescription>
<CtImageFile>removed</CtImageFile>
<SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192</SeriesInstanceUid>
<DateService>2008-08-18</DateService>
<TimeService>02:05:51</TimeService>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178</StudyInstanceUID>
</ResponseHeader>
-<readingSession>
<annotationVersion>3.12</annotationVersion>
<servicingRadiologistID>540461523</servicingRadiologistID>
-<unblindedReadNodule>
<noduleID>Nodule 001</noduleID>
-<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
-<roi>
<imageZposition>-125.000000 </imageZposition>
<imageSOP_UID>1.3.6.1.4.1.14519.5.2.1.6279.6001.110383487652933113465768208719</imageSOP_UID>
......
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
print(id)
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.

The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
print(child.tag)
if child.tag == '{http://www.nih.gov}readingSession':
read = child.find('{http://www.nih.gov}unblindedReadNodule')
if read != None:
nodule_id = read.find('{http://www.nih.gov}noduleID').text
xml_uid = read.find('{http://www.nih.gov}roi').find('{http://www.nih.gov}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{http://www.nih.gov}roi')
print(roi)
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.

get value from variable which contains xml

I have the next jenkins API script:
import jenkins
import json
import re
server = jenkins.Jenkins('https://jenkins_url', username, password)
nodes = json.dumps(server.get_nodes())
nodes = re.sub('\"offline\"|[:{} \[\]]|true,|false,|\"name\"|\"','',nodes).split(',')
for label in nodes:
if label != 'master':
print label
node_config = server.get_node_config(label)
print node_config
node_config contains for example the next xml text:
<?xml version="1.0" encoding="UTF-8"?>
<slave>
<name>test.server</name>
<description></description>
<remoteFS>/var/lib/jenkins</remoteFS>
<numExecutors>1</numExecutors>
<mode>EXCLUSIVE</mode>
<retentionStrategy class="hudson.slaves.RetentionStrategy$Always"/>
<launcher class="hudson.plugins.sshslaves.SSHLauncher" plugin="ssh-slaves#1.10">
<host>test.server</host>
<port>7777</port>
<credentialsId>d0970a8f-d124</credentialsId>
<maxNumRetries>0</maxNumRetries>
<retryWaitTime>0</retryWaitTime>
</launcher>
<label>BuildServer</label>
<nodeProperties/>
<userId>test</userId>
</slave>
I want to get value of each of tag to obtain on output eg test.server etc.
Could you please help me with it?

xml_text = """<?xml version="1.0" encoding="UTF-8"?>
<slave>
<name>test.server</name>
<description></description>
<remoteFS>/var/lib/jenkins</remoteFS>
<numExecutors>1</numExecutors>
<mode>EXCLUSIVE</mode>
<retentionStrategy class="hudson.slaves.RetentionStrategy$Always"/>
<launcher class="hudson.plugins.sshslaves.SSHLauncher" plugin="ssh-slaves#1.10">
<host>test.server</host>
<port>7777</port>
<credentialsId>d0970a8f-d124</credentialsId>
<maxNumRetries>0</maxNumRetries>
<retryWaitTime>0</retryWaitTime>
</launcher>
<label>BuildServer</label>
<nodeProperties/>
<userId>test</userId>
</slave>
"""
import xml.etree.ElementTree
root = xml.etree.ElementTree.fromstring(xml_text)
# show only a particular tag
for name in root.findall('name'):
print(name.text)
# show all children at first level
for child in root:
print('%s: %s' % (child.tag, child.text))
# build a dict (will only get last of any duplicate tags, and no children)
slave = {child.tag: child.text for child in root}
# build a dict (will only get last of any duplicate tags)
def xml_todict(xml_node):
dict_ = {}
for child in xml_node:
dict_[child.tag] = xml_todict(child)
dict_.update(xml_node.attrib)
if not dict_:
return xml_node.text
if xml_node.text and 'text' not in dict_:
dict_['text'] = xml_node.text
return dict_
slave = xml_todict(root)

Xpath select attribute of current node?

I use python with lxml to process the xml. After I query/filter to get to a nodes I want but I have some problem. How to get its attribute's value by xpath ? Here is my input example.
>print(etree.tostring(node, pretty_print=True ))
<rdf:li xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
The value I want is in resource=... . Currently I just use the lxml to get the value. I wonder if it is possible to do in pure xpath ? thanks
EDIT: Forgot to said, this is not a root nodes so I can't use // here. I have like 2000-3000 others in xml file. My first attempt was playing around with ".#attrib" and "self::*#" but those does not seems to work.
EDIT2: I will try my best to explain (well, this is my first time to deal with xml problem using xpath. and english is not one of my favorite field....). Here is my input snippet http://pastebin.com/kZmVdbQQ (full one from here http://www.comp-sys-bio.org/yeastnet/ using version 4).
In my code, I try to get speciesTypes node with resource link chebi (<rdf:li rdf:resource="urn:miriam:obo.chebi:...."/>). and then I tried to get value from rdf:resource attribute in rdf:li. The thing is, I am pretty sure it would be easy to get attribute in child node if I start from parent node like speciesTypes, but I wonder how to do if I start from rdf:li. From my understanding, the "//" in xpath will looking for node from everywhere not just only in the current node.
below is my code
import lxml.etree as etree
tree = etree.parse("yeast_4.02.xml")
root = tree.getroot()
ns = {"sbml": "http://www.sbml.org/sbml/level2/version4",
"rdf":"http://www.w3.org/1999/02/22-rdf-syntax-ns#",
"body":"http://www.w3.org/1999/xhtml",
"re": "http://exslt.org/regular-expressions"
}
#good enough for now
maybemeta = root.xpath("//sbml:speciesType[descendant::rdf:li[starts-with(#rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(#rdf:resource, 'urn:miriam:uniprot'))]]", namespaces = ns)
def extract_name_and_chebi(node):
name = node.attrib['name']
chebies = node.xpath("./sbml:annotation//rdf:li[starts-with(#rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(#rdf:resource, 'urn:miriam:uniprot'))]", namespaces=ns) #get all rdf:li node with chebi resource
assert len(chebies) == 1
#my current solution to get rdf:resource value from rdf:li node
rdfNS = "{" + ns.get('rdf') + "}"
chebi = chebies[0].attrib[rdfNS + 'resource']
#do protein later
return (name, chebi)
metaWithChebi = map(extract_name_and_chebi, maybemeta)
fo = open("metabolites.txt", "w")
for name, chebi in metaWithChebi:
fo.write("{0}\t{1}\n".format(name, chebi))

Prefix the attribute name with # in the XPath query:
>>> from lxml import etree
>>> xml = """\
... <?xml version="1.0" encoding="utf8"?>
... <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
... <rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
... </rdf:RDF>
... """
>>> tree = etree.fromstring(xml)
>>> ns = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
>>> tree.xpath('//rdf:li/#rdf:resource', namespaces=ns)
['urn:miriam:obo.chebi:CHEBI%3A37671']
EDIT
Here's a revised version of the script in the question:
import lxml.etree as etree
ns = {
'sbml': 'http://www.sbml.org/sbml/level2/version4',
'rdf':'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'body':'http://www.w3.org/1999/xhtml',
're': 'http://exslt.org/regular-expressions',
}
def extract_name_and_chebi(node):
chebies = node.xpath("""
.//rdf:li[
starts-with(#rdf:resource, 'urn:miriam:obo.chebi')
]/#rdf:resource
""", namespaces=ns)
return node.attrib['name'], chebies[0]
with open('yeast_4.02.xml') as xml:
tree = etree.parse(xml)
maybemeta = tree.xpath("""
//sbml:speciesType[descendant::rdf:li[
starts-with(#rdf:resource, 'urn:miriam:obo.chebi')]]
""", namespaces = ns)
with open('metabolites.txt', 'w') as output:
for node in maybemeta:
output.write('%s\t%s\n' % extract_name_and_chebi(node))

To select off the current node its attribute named rdf:resource, use this XPath expression:
#rdf:resource
In order for this to "work correctly" you must register the association of the prefix "rdf:" to the corresponding namespace.
If you don't know how to register the rdf namespace, it is still possible to select the attribute -- with this XPath expression:
#*[name()='rdf:resource']

Well, I got it. The xpath expression I need here is "./#rdf:resource" not ".#rdf:resource". But why ? I thought "./" indicate the child of current node.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing complex Xml Python 3.4 - python

Related

How to add new nodes into XML tree, reading from a list in Python?

Iterating over XML and selecting specific element tree content

How to get the content of specific grandchild from xml file through python

get value from variable which contains xml

Xpath select attribute of current node?

Categories

Resources