Python comparing XML output to a list - python

I have an XML that looks something like this:
<Import>
<spId>1234</spId>
<GroupFlag>false</GroupFlag>
</Import>
I want to extract the value of spId and compare it with a list and I have the following script:
import xml.etree.ElementTree as ET
xml_file = "c:/somefile.xml"
sp_id_list = ['1234']
tree = ET.parse(xml_file)
root = tree.getroot()
for sp_id in root.findall('./spId'):
if sp_id.text in sp_id_list:
print sp_id.text
This doesn't work for spId (numeric) but works for comparing GroupFlag (string) with a list. Why is this happening and how can I rectify this problem?
Sorry for the stupid question, I am a noob to this.

Your code example works correctly if your XML sample posted here is given as input XML file.
However you want to find all elements. So, I assume that your real document has many <Import> items. If a list of items is not wrapped by some parent tag it is not a valid XML. In that case you would have xml.etree.ElementTree.ParseError.
So, I assume that in your real document <Import> is not a root element and <Import> elements are somewhere deeper in the document, for example
<Parent>
<Import>
<spId>1234</spId>
<GroupFlag>false</GroupFlag>
</Import>
<Import>
<spId>1234</spId>
<GroupFlag>false</GroupFlag>
</Import>
</Parent>
In that case the search pattern './spId' cannot find those tags, since that pattern matches only direct children of the root element. So, you can use XPath matching tags all levels beneath or even better pointing direct path from the root to the level where spId is located:
# all subelements, on all levels beneath the current element
root.findall('.//spId')
# all spId elements directly in Import tags that are directly
# beneath the root element (as in the above XML example)
root.findall('./Import/spId'):

Related

Xpath that returns the whole document in python

I am trying to debug some inherited code. So there is a line of code
elt = doc.xpath('body/div/pre[#id="bb flags"]')[0]
I want to see what the entire document looks like at that point, rather than just a specific piece. So what xpath should I insert there?
i.e.
elt_entire_document = doc.xpath(new xpath)
logging.info("The full document here is " + elt_entire_document.text)
Is this even possible with xpath, or is it more complicated than that?
Simply like this, based on our comments :
logging.info("The full document here is " + text)
Your question title seems to be asking about selecting a whole document, yet your question body seems to be asking about displaying a selected node...
Selecting the whole document via XPath
Selecting the whole document might mean any of the following:
/ selects the root node of XML document.
/* selects the document element (aka the root element) of the XML document.(Its parent is the root node.)
string(/) selects the string-value of the XML document.
See also:
What is the difference between root node, root element and document element in XML?
How to pretty print XML from the command line?
Python pretty print subtree
How to use lxml and python to pretty print a subtree of an xml file?
how to get the full contents of a node using xpath & lxml?

xml.etree.ElementTree not finding all Elements in XML

I have the following XML file that I'm trying to iterate through using xml.etree:
<safetypadapiresponse><url></url><refcode /><status>SUCCESS</status><message><pcrs>
<pcr>
<eCase01m>1234</eCase01m>
<eProcedures03>12 Lead ECG Obtained</eProcedures03>
<eMedications03>Oxygen</eMedications03>
</pcr>
</pcrs></message></safetypadapiresponse>
I'm unable to find any of the child elements after 'message' with the following:
import xml.etree.ElementTree as ET
tree = ET.parse(xmlFile)
root = tree.getroot()
for member in root.findall('pcr'):
print(member)
The following child elements are listed when the following is run:
for member in root:
print(member)
Element 'url'
Element 'refcode'
Element 'status'
Element 'message'
I'm trying to retrieve all the information under the pcr element (i.e. eCase01m, eProcedures03, eMedications03).
You can use findall() in two ways. Unhelpfully this is mentioned in two different parts of the docs:
Element.findall() finds only elements with a tag which are direct
children of the current element.
...
Finds all matching subelements, by tag name or path. Returns a list
containing all matching elements in document order.
What this means is if you look for a tag, you are only searching the direct children of the current element.
You can use XPath instead to look for the parts you are interested in, which will recurse through the docs looking for matches. Either of the following should do:
root.findall('./message/pcrs/pcr') # Find them relative to this node
root.findall('.//pcr') # Find them anywhere below the current node
For the sake of completeness, let me add that you can also try xpath:
for i in tree.xpath('*//pcr/*'):
print(i.tag)
Output:
eCase01m
eProcedures03
eMedications03

access elements and attribs DIRECTLY using lxml etree

Given the following xml structure:
<root>
<a>
<from name="abc">
<b>xxx</b>
<c>yyy</c>
</from>
<to name="def">
<b>blah blah</b>
<c>another blah blah</c>
</to>
</a>
</root>
How can I access directly the value of "from.b" of each "a" without loading first "from" (with find()) of each "a"?
As you can see there are exactly the same elements under "from" and "to". So the method findall() would not work as I have to differentiate where the value of "b" is coming from.
I would like to get the method of direct access because if I have to load each child element (there is a lot) my code would be quite verbose. And in addition in my case performance counts and I have a lot of XML docs to parse! So I have to find the fastest method to go through the document (and store the data into a DB)
Within each "a" element there is exactly 1 "from" element and within each "from" element there is exactly 1 "b" element.
I have no problem to do this with lxml objectify, but I want to use etree because first I have to parse the XML document with etree because I have to validate first the xml schema against an XSD doc and I do not want to reparse the whole document again.
find (and findall) lets you specify a path to elements as well, for example you can do:
root = ET.fromstring(input_xml)
for a in root.findall('a'):
print(a, a.find('from/b').text)
assuming you do always have exactly one from and b element.
otherwise, I might be tempted to use findall and do checks in Python code if this is designed to be more robust

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Quickest/Best way to traverse XML with lxml in Python

I have an XML file that looks like this:
xml = '''<?xml version="1.0"?>
<root>
<item>text</item>
<item2>more text</item2>
<targetroot>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
</targetroot>
...more items
</root>
'''
With lxml I'm trying to acces the text in the element < target >. I've found a solution, but I'm sure there is a better, more efficient way to do this. My solution:
target = etree.XML(xml)
for x in target.getiterator('root'):
item1 = x.findtext('item')
for target in x.iterchildren('targetroot'):
for t in target.iterchildren('targetcontainer'):
targetText = t.findtext('target')
Although this works, as it gives me acces to all the elements in root as well as the target element, I'm having a hard time believing this is the most efficient solution.
So my question is this: is there a more efficient way to access the < target >'s texts while staying in the loop of root, because I also need access to the other elements.
You can use XPath:
for x in target.xpath('/root/targetroot/targetcontainer/target'):
print x.text
We ask all elements that match a path. In this case, the path is /root/targetroot/targetcontainer/target, which means
all the <target> elements that are inside a <targetcontainer> element, inside a <targetroot> element, inside a <root> element. Also, the <root> element should be the document root because it is preceded by /, which means the beginning of the document.
Also, your XML document had two problems. First, the <?xml version="1.0"?> declaration should be the very first thing in the document - and in this example it is preceded by a newline and some space. Also, it is not a tag and should not be closed, so the </xml> at the end of your string should be removed. I already edited your question anyway.
EDIT: this solution can be improved yet. You do not need to pass all the path - you can just ask to all elements <target> inside the document. This is done by preceding the tag name by two slashes. Since you want all the <target> texts, independent of where they are, this can be a better solution. So, the loop above can be written just as:
for x in target.xpath('//target'):
print x.text
I tried it at first but it did not worked. The problem, however, was the syntax problems in the XML, not the XPath, but I tried the other, longer path and forgot to retry this one. Sorry! Anyway, I hope I put some light about XPath nonetheless :)

Categories