Iterating over XML and selecting specific element tree content - python

I have an XML which looks like this:
<openie>
<triple confidence="1.000">
<subject begin="0" end="1">
<text>PAF</text>
<lemma>paf</lemma>
</subject>
<relation begin="1" end="2">
<text>gets</text>
<lemma>get</lemma>
</relation>
<object begin="2" end="6">
<text>name of web site</text>
<lemma>name of web site</lemma>
</object>
</triple>
<triple confidence="1.000">
<subject begin="0" end="1">
<text>PAF</text>
<lemma>paf</lemma>
</subject>
<relation begin="1" end="2">
<text>gets</text>
<lemma>get</lemma>
</relation>
<object begin="2" end="3">
<text>name</text>
<lemma>name</lemma>
</object>
</triple>
</openie>
The element openie is nested in here: root>document>sentences>sentence>openie
And in my function I am trying to print triples which each contain subject, relation, object elements. Unfortunately, I can not get it to work, since I am unable to get into these three elements and their text element. Which part is wrong?
def get_openie():
print('OpenIE parser start...')
tree = ET.parse('./tmp/nlp_output.xml')
root = tree.getroot()
for triple in root.findall('./document/sentences/sentence/openie/triple'):
t_subject = triple.find('subject/text').text
t_relation = triple.find('relation/text').text
t_object = triple.get('object/text').text
print(t_subject,t_relation,t_object)
Output for two triples should look like this:
PAF gets name of web site
PAF gets name

To get your t_object you're running triple.get() instead of triple.find(). Changing that fixes your problem.
def get_openie():
print('OpenIE parser start...')
tree = ET.parse('./tmp/nlp_output.xml')
root = tree.getroot()
for triple in root.findall('./document/sentences/sentence/openie/triple'):
t_subject = triple.find('subject/text').text
t_relation = triple.find('relation/text').text
t_object = triple.find('object/text').text
print(t_subject,t_relation,t_object)

Related

How to get the content of specific grandchild from xml file through python

Hi I am very new to python programming. I have an xml file of structure:
<?xml version="1.0" encoding="UTF-8"?>
-<LidcReadMessage xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.nih.gov" uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1307390687803.0">
-<ResponseHeader>
<Version>1.8.1</Version>
<MessageId>-421198203</MessageId>
<DateRequest>2007-11-01</DateRequest>
<TimeRequest>12:30:44</TimeRequest>
<RequestingSite>removed</RequestingSite>
<ServicingSite>removed</ServicingSite>
<TaskDescription>Second unblinded read</TaskDescription>
<CtImageFile>removed</CtImageFile>
<SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192</SeriesInstanceUid>
<DateService>2008-08-18</DateService>
<TimeService>02:05:51</TimeService>
<ResponseDescription>1 - Reading complete</ResponseDescription>
<StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.298806137288633453246975630178</StudyInstanceUID>
</ResponseHeader>
-<readingSession>
<annotationVersion>3.12</annotationVersion>
<servicingRadiologistID>540461523</servicingRadiologistID>
-<unblindedReadNodule>
<noduleID>Nodule 001</noduleID>
-<characteristics>
<subtlety>5</subtlety>
<internalStructure>1</internalStructure>
<calcification>6</calcification>
<sphericity>3</sphericity>
<margin>3</margin>
<lobulation>3</lobulation>
<spiculation>4</spiculation>
<texture>5</texture>
<malignancy>5</malignancy>
</characteristics>
-<roi>
<imageZposition>-125.000000 </imageZposition>
<imageSOP_UID>1.3.6.1.4.1.14519.5.2.1.6279.6001.110383487652933113465768208719</imageSOP_UID>
......
There are four which contains multiple . Each contains an . I need to extract the information in from all of these headers.
Right now I am doing this:
import xml.etree.ElementTree as ET
tree = ET.parse('069.xml')
root = tree.getroot()
#lst = []
for readingsession in root.iter('readingSession'):
for roi in readingsession.findall('roi'):
id = roi.findtext('imageSOP_UID')
print(id)
but it ouputs like this:
Process finished with exit code 0.
If anyone can help.
The real problem as been wit the namespace. I tried with and without it, but it didn't work with this code.
ds = pydicom.dcmread("000071.dcm")
uid = ds.SOPInstanceUID
tree = ET.parse("069.xml")
root = tree.getroot()
for child in root:
print(child.tag)
if child.tag == '{http://www.nih.gov}readingSession':
read = child.find('{http://www.nih.gov}unblindedReadNodule')
if read != None:
nodule_id = read.find('{http://www.nih.gov}noduleID').text
xml_uid = read.find('{http://www.nih.gov}roi').find('{http://www.nih.gov}imageSOP_UID').text
if xml_uid == uid:
print(xml_uid, "=", uid)
roi= read.find('{http://www.nih.gov}roi')
print(roi)
This work completely fine to get a uid from dicom image of LIDC/IDRI dataset and then extract the same uid from xml file for it region of interest.

How to select a sibling's child node using XPath in Python?

I'm using Python and lxml. My xml file:
<Example>
<Path>
<Some.Node>
// ...
</Some.Node>
<Some.Node>
<Known.Node KnownAttribute="123"/>
<Some.Stuff>
<Nothing.Important>Bla</Nothing.Important>
</Some.Stuff>
<Relevant.Node>
<Property>
<Name>Some</Name>
<Value>True</Value>
</Property>
<Property>
<Name>Known.Name</Name>
<Value>Desired Value</Value>
</Property>
<Property>
<Name>Some.Other</Name>
<Value>Yes</Value>
</Property>
// ...
</Relevant.Node>
// ...
</Some.Node>
<Some.Node>
// ...
</Some.Node>
</Path>
</Example>
There are multiple <Some.Node> nodes and I'm only interested in the one with KnownAttribute equal to 123. This part I got:
query = "//Known.Node[#KnownAttribute='%s']" % attribute_value
However, I need to get the value of <Relevant.Node>/<Property>/<Value> where <Name> has value Known.Name.
This was my best try but it didn't work:
root = etree.parse(xml_file).getroot()
query = "//Known.Node[#KnownAttribute='%s']/..//Property[Name='Known.Name']/Value" % attribute_value
result = root.xpath(query)
print(result[0].text)
It should print, of course, Desired Value but it just returns empty value/whitespace.
How can I get the value I need?
You are really close. You can ask for the text of the node in the xpath expression.
query = "//Known.Node[#KnownAttribute='%s']/..//" % attribute_value
query += "Property[Name='Known.Name']/Value/text()"
result = root.xpath(query)
print(result[0])
# prints:
Desired Value

Python XML AttributeError: 'NoneType' object has no attribute 'text'

I am having a problem figuring out why I receive the error below
AttributeError: 'NoneType' object has no attribute 'text'
I am trying to import a XML file using Python 2.7. Below is what my XML file looks like.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "file.dtd">
<top>
<blue key="2343998978">
<animal>lion</animal>
<animal>seal</animal>
<state>california</state>
<zoo>san diego</zoo>
<year>2015</year>
</blue>
<red key="9383893838739">
<elem_a>jennifer</elem_a>
<elem_a>paul</elem_a>
<elem_a>carl</elem_a>
<elem_b>kansas</elem_b>
<elem_d>australia</elem_d>
</red>
<yellow key="83963277272">
<car>chevy</car>
<car>dodge</car>
<cap>baseball</cap>
<cat>tabby</cat>
</yellow>
<red key="9383893838739">
<elem_a>greg</elem_a>
<elem_a>chris</elem_a>
<elem_a>john</elem_a>
<elem_b>arkansas</elem_b>
<elem_c>ice cream</elem_c>
</red>
<yellow key="84748346734">
<car>toyota</car>
<car>honda</car>
<cap>football</cap>
</yellow>
</top>
I am new to Python but created the script below to import the XML file above and that is when I receive the error above. Below is my code.
import xml.etree.ElementTree as ET
myfile = 'C:/Users/user1/Desktop/file.xml'
tree = ET.parse(myfile)
root = tree.getroot()
for x in root.findall('blue'):
animal = x.find('animal').text
key1 = x.attrib['key']
state = x.find('state').text
zoo = x.find('zoo').text
year = x.find('year').text
print animal, key1, state, zoo, year
for y in root.findall('red'):
elem_a = y.find('elem_a').text
key2 = y.attrib['key']
elem_b = y.find('elem_b').text
elem_c = y.find('elem_c').text
elem_d = y.find('elem_d').text
print elem_a, key2, elem_b, elem_c, elem_d
for z in root.findall('yellow'):
car = z.find('car').text
key3 = z.attrib['key']
cap = z.find('cap').text
cat = z.find('cat').text
print car, key3, cap, cat
In the XML file there are three main element types: blue, red and yellow. One of the problems specific child elements exist for some parent elements are not for others. For example, in the sample XML file above, one "yellow" element has three child elements including "car", "cat" and "cap" but not each "yellow" element has all three child elements. In the XML below the first "yellow" element has the "cat" child node and the second "yellow" element does not have the "cat" child element but in the full XML file the "yellow" elements could have any one, two or three of the "cat", "cap" and "car" child elements. I know this is causing the error but I do not know how to resolve it. Does anyone have any ideas or tips as to how to resolve this error? Thank you.
You can go through the tree, for x in root: goes through the root tags blue, red and yellow, then for every color tag you can loop again for the subtree.
x.tag tag-name of an element.
x.attrib a map with attributes of an element.
x.getchildren() is a list of all the children elements of an element.
x.text is the text content of an element.
An example:
import xml.etree.ElementTree as ET
my_file = 'C:/Users/user1/Desktop/file.xml'
tree = ET.parse(my_file)
root = tree.getroot()
def print_subtree(subtree):
for y in subtree:
print "\t", y.tag, ":", y.text
for x in root:
print x.tag, x.attrib
print_subtree(x.getchildren())
This works fine with a two level tree, for a n-level tree recursion would be necessary.
Some docs: https://docs.python.org/2/library/xml.etree.elementtree.html
Something about recursion: Xml parsing with Python using recursion. Problem with return value

Parsing complex Xml Python 3.4

I have the following xml :
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Suite>
<TestCase>
<TestCaseID>001</TestCaseID>
<TestCaseDescription>Hello</TestCaseDescription>
<TestSetup>
<Action>
<ActionCommand>gfdg</ActionCommand>
<TimeOut>dfgd</TimeOut>
<BamSymbol>gff</BamSymbol>
<Side>vfbgc</Side>
<PrimeBroker>fgfd</PrimeBroker>
<Size>fbcgc</Size>
<PMCode>fdgd</PMCode>
<Strategy>fdgf</Strategy>
<SubStrategy>fgf</SubStrategy>
<ActionLogEndPoint>fdgf</ActionLogEndPoint>
<IsActionResultLogged>fdgf</IsActionResultLogged>
<ValidationStep>
<IsValidated>fgdf</IsValidated>
<ValidationFormat>dfgf</ValidationFormat>
<ResponseEndpoint>gdf</ResponseEndpoint>
<ResponseParameterName>fdgfdg</ResponseParameterName>
<ResponseParameterValue>gff</ResponseParameterValue>
<ExpectedValue>fdgf</ExpectedValue>
<IsValidationResultLogged>gdfgf</IsValidationResultLogged>
<ValidationLogEndpoint>fdgf</ValidationLogEndpoint>
</ValidationStep>
</Action>
<Action>
<ActionCommand>New Order</ActionCommand>
<TimeOut>fdgf</TimeOut>
<BamSymbol>fdg</BamSymbol>
<Side>C(COVER)</Side>
<PrimeBroker>CSPB</PrimeBroker>
<Size>fdgd</Size>
<PMCode>GREE</PMCode>
<Strategy>Generalist</Strategy>
<SubStrategy>USLC</SubStrategy>
<ActionLogEndPoint>gfbhgf</ActionLogEndPoint>
<IsActionResultLogged>fdgf</IsActionResultLogged>
<ValidationStep>
<IsValidated>fdgd</IsValidated>
<ValidationFormat>dfgfd</ValidationFormat>
<ResponseEndpoint>dfgf</ResponseEndpoint>
<ResponseParameterName>fdgfd</ResponseParameterName>
<ResponseParameterValue>dfgf</ResponseParameterValue>
<ExpectedValue>fdg</ExpectedValue>
<IsValidationResultLogged>fdgdf</IsValidationResultLogged>
<ValidationLogEndpoint>fdgfd</ValidationLogEndpoint>
</ValidationStep>
</Action>
</TestCase>
</Suite>
Based on the ActionCommand i am getting either one block , the issue is could not get the sub parent tag (ValidationStep) and all its child tags . Can anyone help?
My code:
for testSetup4 in root.findall(".TestCase/TestSetup/Action"):
if testSetup4.find('ActionCommand').text == "gfdg":
for c1 in testSetup4:
t2.append(c1.tag)
v2.append(c1.text)
for k,v in zip(t2, v2):
test_case[k] = v
I am not able to get ValidationStep (sub parent) and its corresponding tags.
Simply add another loop to iterate through the <ValidationStep> node and its children. Also, you do not need the two other lists as you can update a dictionary during the parsing loop:
import xml.etree.ElementTree as et
dom = et.parse('Input.xml')
root = dom.getroot()
test_case = {}
for testSetup4 in root.findall(".TestCase/TestSetup/Action"):
if testSetup4.find('ActionCommand').text == "gfdg":
for c1 in testSetup4:
test_case[c1.tag]= c1.text
for vd in testSetup4.findall("./ValidationStep/*"):
test_case[vd.tag]= vd.text
Alternatively, use the double slash operator to search for all children including grandchildren of <Action> element:
for testSetup4 in root.findall(".TestCase/TestSetup/Action"):
if testSetup4.find('ActionCommand').text == "gfdg":
for c1 in testSetup4.findall(".//*"):
test_case[c1.tag]= c1.text

Generating Xml using python

Kindly have a look at below code i am using this to generate a xml using python .
from lxml import etree
# Some dummy text
conn_id = 5
conn_name = "Airtelll"
conn_desc = "Largets TRelecome"
ip = "192.168.1.23"
# Building the XML tree
# Note how attributes and text are added, using the Element methods
# and not by concatenating strings as in your question
root = etree.Element("ispinfo")
child = etree.SubElement(root, 'connection',
number = str(conn_id),
name = conn_name,
desc = conn_desc)
subchild_ip = etree.SubElement(child, 'ip_address')
subchild_ip.text = ip
# and pretty-printing it
print etree.tostring(root, pretty_print=True)
This will produce:
<ispinfo>
<connection desc="Largets TRelecome" number="5" name="Airtelll">
<ip_address>192.168.1.23</ip_address>
</connection>
</ispinfo>
But i want it to be like :
<ispinfo>
<connection desc="Largets TRelecome" number='1' name="Airtelll">
<ip_address>192.168.1.23</ip_address>
</connection>
</ispinfo>
Mean number attribute should be come in a single quote .Any idea ....How can i achieve this
There is no flag in lxml to do this, so you have to resort to manual manipulation.
import re
re.sub(r'number="([0-9]+)"',r"number='\1'", etree.tostring(root, pretty_print=True))
However, why do you want to do this? As there is no difference other than cosmetics.

Categories