Basic Python Parsing XML with xml.etree - Issue - python

I am trying to parse XML and am hard time having. I dont understand why the results keep printing [<Element 'Results' at 0x105fc6110>]
I am trying to extract Social from my example with the
import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
results = root.findall("Results")
print results #[<Element 'Results' at 0x105fc6110>]
# WHAT IS THIS??
for result in results:
print result.find("Social") #None
the XML looks like this:
<?xml version="1.0"?>
<List1>
<NextOffset>AAA</NextOffset>
<Results>
<R>
<D>internet.com</D>
<META>
<Social>
<v>http://twitter.com/internet</v>
<v>http://facebook.com/internet</v>
</Social>
<Telephones>
<v>+1-555-555-6767</v>
</Telephones>
</META>
</R>
</Results>
</List1>

findall returns a list of xml.etree.ElementTree.Element objects. In your case, you only have 1 Result node, so you could use find to look for the first/unique match.
Once you got it, you have to use find using the .// syntax which allows to search in anywhere in the tree, not only the one directly under Result.
Once you found it, just findall on v tag and print the text:
import xml.etree.ElementTree as ET
root = ET.parse("test.xml")
result = root.find("Results")
social = result.find(".//Social")
for r in social.findall("v"):
print(r.text)
results in:
http://twitter.com/internet
http://facebook.com/internet
note that I did not perform validity check on the xml file. You should check if the find method returns None and handle the error accordignly.
Note that even though I'm not confident myself with xml format, I learned all that I know on parsing it by following this lxml tutorial.

results = root.findall("Results") is a list of xml.etree.ElementTree.Element objects.
type(results)
# list
type(results[0])
# xml.etree.ElementTree.Element
find and findall only look within first children. The iter method will iterate through matching sub-children at any level.
Option 1
If <Results> could potentially have more than one <Social> element, you could use this:
for result in results:
for soc in result.iter("Social"):
for link in soc.iter("v"):
print link.text
That's worst case scenario. If you know there'll be one <Social> per <Results> then it simplifies to:
for soc in root.iter("Social"):
for link in soc.iter("v"):
print link.text
both return
"http://twitter.com/internet"
"http://facebook.com/internet"
Option 2
Or use nested list comprehensions and do it with one line of code. Because Python...
socialLinks = [[v.text for v in soc] for soc in root.iter("Social")]
# socialLinks == [['http://twitter.com/internet', 'http://facebook.com/internet']]
socialLinks is list of lists. The outer list is of <Social> elements (only one in this example)Each inner list contains the text from the v elements within each particular <Social> element .

Related

Parse xml for text of every specific tag not working

I am trying to gather every element <sequence-number> text into a list. Here is my code
#!/usr/bin/env python
from lxml import etree
response = '''
<rpc-reply xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:0d07cdf5-c8e5-45d9-89d1-92467ffd7fe4">
<data>
<ipv4-acl-and-prefix-list xmlns="http://cisco.com/ns/yang/Cisco-IOS-XR-ipv4-acl-cfg">
<accesses>
<access>
<access-list-name>TESTTEST</access-list-name>
<access-list-entries>
<access-list-entry>
<sequence-number>1</sequence-number>
<remark>TEST</remark>
<sequence-str>1</sequence-str>
</access-list-entry>
<access-list-entry>
<sequence-number>10</sequence-number>
<grant>permit</grant>
<source-network>
<source-address>10.10.5.0</source-address>
<source-wild-card-bits>0.0.0.255</source-wild-card-bits>
</source-network>
<next-hop>
<next-hop-type>regular-next-hop</next-hop-type>
<next-hop-1>
<next-hop>10.10.5.2</next-hop>
<vrf-name>SANE</vrf-name>
</next-hop-1>
</next-hop>
<sequence-str>10</sequence-str>
</access-list-entry>
<access-list-entry>
<sequence-number>20</sequence-number>
<grant>permit</grant>
<source-network>
<source-address>10.10.6.0</source-address>
<source-wild-card-bits>0.0.0.255</source-wild-card-bits>
</source-network>
<next-hop>
<next-hop-type>regular-next-hop</next-hop-type>
<next-hop-1>
<next-hop>10.10.6.2</next-hop>
<vrf-name>VRFNAME</vrf-name>
</next-hop-1>
</next-hop>
<sequence-str>20</sequence-str>
</access-list-entry>
</access-list-entries>
</access>
</accesses>
</ipv4-acl-and-prefix-list>
</data>
</rpc-reply>
'''
q = etree.fromstring(response)
print(q.findall('.//sequence-number'))
But get nothing for output. I have tried the following statements too with no luck:
print(q.findall('./sequence-number/'))
print(q.findall('sequence-number/'))
print(q.findall('.//sequence-number/'))
print(q.findall('sequence-number'))
How can I gather this data?
As mentioned in comments, xml namespaces should be considered. The most simple way to handle them would be using the xpath function instead of findall along with the slight modification of the search expression:
print(q.xpath(".//*[local-name()='sequence-number']"))
Here expression .//*[local-name()='sequence-number']contains the wildcard * with the predicate [local-name()='sequence-number']. It means that every child element should be select having local name (without namespace consideration) equal to "sequence-number".
Another approach would be creation of a namespace map and passing it to the findall function:
ns = {"ns":"http://cisco.com/ns/yang/Cisco-IOS-XR-ipv4-acl-cfg"}
print(q.findall(".//ns:sequence-number", ns))

can we search multiple pattern using etree findall() in xml?

For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?
The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>

How to get values from this XML?

I want to parse xml like this:
<?xml version="1.0" ?>
<matches>
<round_1>
<match_1>
<home_team>team_5</home_team>
<away_team>team_13</away_team>
<home_goals_time>None</home_goals_time>
<away_goals_time>24;37</away_goals_time>
<home_age_average>27.4</home_age_average>
<away_age_average>28.3</away_age_average>
<score>0:2</score>
<ball_possession>46:54</ball_possession>
<shots>8:19</shots>
<shots_on_target>2:6</shots_on_target>
<shots_off_target>5:10</shots_off_target>
<blocked_shots>1:3</blocked_shots>
<corner_kicks>3:4</corner_kicks>
<fouls>10:12</fouls>
<offsides>0:0</offsides>
</match_1>
</round_1>
</matches>
I use standard library - xml but I can't get values from inner tags. That's my exemplary code:
import xml.etree.ElementTree as et
TEAMS_STREAM = "data/stats1.xml"
tree = et.parse(TEAMS_STREAM)
root = tree.getroot()
for elem in root.iter('home_goals_time'):
print(elem.attrib)
It should work but it's not. I was trying to find issue in xml structure but I coludn't find it. I always got empty dict. Can you tell me what's wrong?
You are calling .attrib on the element, but there are no attributes for those elements. If you want to print the inner text of the element, use .text instead of .attrib
for elem in root.iter('home_goals_time'):
print(elem.text)
The reason you're having issues is that you need to parse through the xml level by level. Using findall, I was able to get the value inside <home_goals_time>.
for i in root.findall('.//home_goals_time'):
print (i.text)
None

ElementTree XML API not matching subelement

I am attempting to use the USPS API to return the status of package tracking. I have a method that returns an ElementTree.Element object built from the XML string returned from the USPS API.
This is the returned XML string.
<?xml version="1.0" encoding="UTF-8"?>
<TrackResponse>
<TrackInfo ID="EJ958088694US">
<TrackSummary>The Postal Service could not locate the tracking information for your
request. Please verify your tracking number and try again later.</TrackSummary>
</TrackInfo>
</TrackResponse>
I format that into an Element object
response = xml.etree.ElementTree.fromstring(xml_str)
Now I can see in the xml string that the tag 'TrackSummary' exists and I would expect to be able to access that using ElementTree's find method.
As extra proof I can iterate over the response object and prove that the 'TrackSummary' tag exists.
for item in response.iter():
print(item, item.text)
returns:
<Element 'TrackResponse' at 0x00000000041B4B38> None
<Element 'TrackInfo' at 0x00000000041B4AE8> None
<Element 'TrackSummary' at 0x00000000041B4B88> The Postal Service could not locate the tracking information for your request. Please verify your tracking number and try again later.
So here is the problem.
print(response.find('TrackSummary')
returns
None
Am I missing something here? Seems like I should be able to find that child element without a problem?
import xml.etree.cElementTree as ET # 15 to 20 time faster
response = ET.fromstring(str)
Xpath Syntax
Selects all child elements. For example, */egg selects all grandchildren named egg.
element = response.findall('*/TrackSummary') # you will get a list
print element[0].text #fast print else iterate the list
>>> The Postal Service could not locate the tracking informationfor your request. Please verify your tracking number and try again later.
The .find() method only searches the next layer, not recursively. To search recursively, you need to use an XPath query. In XPath, the double slash // is a recursive search. Try this:
# returns a list of elements with tag TrackSummary
response.xpath('//TrackSummary')
# returns a list of the text contained in each TrackSummary tag
response.xpath('//TrackSummary/node()')

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Categories