Parsing and extracting field values from XML files - python

I have a couple of GB of log files in the XML format. What I am interested in is finding the rows with a specific command and extract the user that ordered the command.
Which is to say I want to check a particular field in all rows for a specific value and then extract an unknown value from another field on the same line. How do I go about doing that? (I've tried turning to documentation and other sources without understanding how it works.)
I currently use Python 2.7.15, but if 3.* is better or easier in any way I'll use that.
Here's an example of a row in a logfile that I'm interested in:
<?xml version="1.0" encoding="UTF-8"?>
<IHEYr4>
<UserAuthenticated>
<LocalUsername>User1</LocalUsername>
<Action>Login</Action>
</UserAuthenticated>
<Host>192.168.1.15</Host>
<TimeStamp>2018-01-18T02:31:00</TimeStamp>
</IHEYr4>

Using ElementTree
Demo:
x = """<?xml version="1.0" encoding="UTF-8"?>
<IHEYr4>
<UserAuthenticated>
<LocalUsername>User1</LocalUsername>
<Action>Login</Action>
</UserAuthenticated>
<Host>192.168.1.15</Host>
<TimeStamp>2018-01-18T02:31:00</TimeStamp>
</IHEYr4>
"""
import xml.etree.ElementTree as ET
xmlVal = ET.fromstring(x)
if xmlVal.find("UserAuthenticated/Action").text == 'Login':
print(xmlVal.find("Host").text, xmlVal.find("TimeStamp").text)
Output:
('192.168.1.15', '2018-01-18T02:31:00')

Related

can we search multiple pattern using etree findall() in xml?

For my case, I have to find few elements in the XML file and update their values using the text attribute. For that, I have to search xml element A, B and C. My project is using xml.etree and python language. Currently I am using:
self.get_root.findall(H/A/T)
self.get_root.findall(H/B/T)
self.get_root.findall(H/C/T)
The sample XML file:
<H><A><T>text-i-have-to-update</H></A></T>
<H><B><T>text-i-have-to-update</H></B></T>
<H><C><T>text-i-have-to-update</H></C></T>
As we can notice, only the middle element in the path is different. Is there a way to optimize the code using something like self.get_root.findall(H|(A,B,C)|T)? Any guidance in the right direction will do! Thanks!
I went through the similar question: XPath to select multiple tags but it didn't work for my case
Update: maybe regular expression inside the findall()?
The html in your question is malformed; assuming it's properly formatted (like below), try this:
import xml.etree.ElementTree as ET
data = """<root>
<H><A><T>text-i-have-to-update</T></A></H>
<H><B><T>text-i-have-to-update</T></B></H>
<H><C><T>text-i-have-to-update</T></C></H>
</root>"""
doc = ET.fromstring(data)
for item in doc.findall('.//H//T'):
item.text = "modified text"
print(ET.tostring(doc).decode())
Output:
<root>
<H><A><T>modified text</T></A></H>
<H><B><T>modified text</T></B></H>
<H><C><T>modified text</T></C></H>
</root>

Python add new element by xml ElementTree

XML file
<?xml version="1.0" encoding="utf-8"?>
<Info xmlns="BuildTest">
<RequestDate>5/4/2020 12:27:46 AM</RequestDate>
</Info>
I want to add a new element inside the Info tag.
Here is what I did.
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
ele = ET.Element('element1')
ele.text = 'ele1'
root.append(ele)
tree.write("output.xhtml")
Output
<ns0:Info xmlns:ns0="BuildTest">
<ns0:RequestDate>5/4/2020 12:27:46 AM</ns0:RequestDate>
<element1>ele1</element1></ns0:Info>
Three questions:
The <?xml version="1.0" encoding="utf-8"?> is missing.
The namespace is wrong.
The whitespace of the new element is gone.
I saw many questions related to this topic, most of them are suggesting other packages.
Is there any way it can handle properly?
The processing instructions are not considered XML elements. Just Google are processing instructions part of an XML, and the first result states:
Processing instructions are markup, but they're not elements.
Since the package you are using is literally called ElementTree, you can reasonably expect its objects to be a trees of elements. If I remember correctly, DOM compliant XML packages can support non-element markup in XML.
For the namespace issue, the answer is in stack overflow, at Remove ns0 from XML - you just have to register the namespace you specified in the top element of your document. The following worked for me:
ET.register_namespace("", "Buildtest")
As for the whitespace - the new element does not have any whitespace. You can assign to the tail member to add a linefeed after an element.

Python XML Parser not Returning XML Elemements

I am trying to find a way to use Python to parse data from several .xml files that contain part numbers and descriptions for a system my team is working on. Here's what the files look like:
Note: Actual data sanitized for confidentiality reasons.
<DOCUMENT>
<config>
<lruname>NFS</lruname>
<swpn>123-A-456-7890</swpn>
<swname>00 NFS ABC DEFGHI XYZ JKL</swname>
<swver>Appid: abc-defghi-xyz PN: 123-A-456-7890</swver>
</config>
</DOCUMENT>
I'd like to pull the and datatypes from several of these files into .csv format. My initial thought was to try to parse these data types out into a dictionary using the built in xml.etree library, but for some reason it's not finding the elements:
import xml.etree.ElementTree as ET
data = '''
<DOCUMENT>
<config>
<lruname>NFS</lruname>
<swpn>123-A-456-7890</swpn>
<swname>00 NFS ABC DEFGHI XYZ JKL</swname>
<swver>Appid: abc-defghi-xyz PN: 123-A-456-7890</swver>
</config>
</DOCUMENT>
'''
tree = ET.fromstring(data)
PartNo = tree.find('swpn')
Desc = tree.find('swname')
print(PartNo)
The above code returns 'None' for some reason, but I would expect it to return the xml element I'm calling.
I think you're missing the config level in your XML hierarchy, you could do:
part_number = tree.find('config').find('swpn').text
part_desc = tree.find('config').find('swname').text
Alternately you can loop through all the elements if you don't want to have to know the structure and use conditionals to find the elements you care about with tree.iter.
for e in tree.iter():
if e.tag == 'sqpn':
part_number = e.text
if e.tag == 'swname':
part_desc = e.text
ElementTree and etree's find functionality searchers for direct children.
You can still use it by specifying the entire branch:
tree.find('config').find('swpn')
tree.find('config/swpn')
If you always want to look for swpn, but disregard the structure (e.g. you don't know if it's going to be a child of config), you might find it easier to use the xpath functionality in etree (and not in ElementTree):
tree = etree.fromstring(data)
tree.xpath('//swpn')
In this case, the // basically mean that you are looking for elements in tree, no matter where they are
If the xml files are small, and you don't care about performance, you can use minidom which IMHO is more convenient compared to lxml. In this case, your code could be something like this:
from xml.dom.minidom import parseString
xml = parseString(data)
PartNo = xml.getElementsByTagName('swpn')[0]
Desc = xml.getElementsByTagName('swname')[0]
print(PartNo.firstChild.nodeValue)

XML parsing in Python using Python 2 or 3

I'm just trying to write a simple program to allow me to parse some of the following XML.
So far in following examples I am not getting the results I'm looking for.
I encounter many of these XML files and I generally want the info after a handful of tags.
What's the best way using elementtree to be able to do a search for <Id> and grab what ever info is in that tag. I was trying things like
for Reel in root.findall('Reel'):
... id = Reel.findtext('Id')
... print id
Is there a way just to look for every instance of <Id> and grab the urn: etc that comes after it? Some code that traverses everything and looks for <what I want> and so on.
This is a very truncated version of what I usually deal with.
This didn't get what I wanted at all. Is there an easy just to match <what I want> in any XML file and get the contents of that tag, or do i need to know the structure of the XML well enough to know its relation to Root/child etc?
<Reel>
<Id>urn:uuid:632437bc-73f9-49ca-b687-fdb3f98f430c</Id>
<AssetList>
<MainPicture>
<Id>urn:uuid:46afe8a3-50be-4986-b9c8-34f4ba69572f</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
<FrameRate>24 1</FrameRate>
<ScreenAspectRatio>2048 858</ScreenAspectRatio>
</MainPicture>
<MainSound>
<Id>urn:uuid:1fce0915-f8c7-48a7-b023-36e204a66ed1</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>340</IntrinsicDuration>
<EntryPoint>0</EntryPoint>
<Duration>340</Duration>
</MainSound>
</AssetList>
</Reel>
#Mata that worked perfectly, but when I tried to use that for different values on another XML file I fell flat on my face. For instance, what about this section of a file.I couldn't post the whole thing unfortunately. What if I want to grab what comes after KeyId?
<?xml version="1.0" encoding="UTF-8" standalone="no" ?><DCinemaSecurityMessage xmlns="http://www.digicine.com/PROTO-ASDCP-KDM-20040311#" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<!-- Generated by Wailua Version 0.3.20 -->
<AuthenticatedPublic Id="ID_AuthenticatedPublic">
<MessageId>urn:uuid:7bc63f4c-c617-4d00-9e51-0c8cd6a4f59e</MessageId>
<MessageType>http://www.digicine.com/PROTO-ASDCP-KDM-20040311#</MessageType>
<AnnotationText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE ~ KDM for Quvis-10010.pem</AnnotationText>
<IssueDate>2007-04-29T04:13:43-00:00</IssueDate>
<Signer>
<dsig:X509IssuerName>dnQualifier=BzC0n/VV/uVrl2PL3uggPJ9va7Q=,CN=.deluxe-admin-c,OU=.mxf-j2c.ca.cinecert.com,O=.ca.cinecert.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>10039</dsig:X509SerialNumber>
</Signer>
<RequiredExtensions>
<Recipient>
<X509IssuerSerial>
<dsig:X509IssuerName>dnQualifier=RUxyQle0qS7qPbcNRFBEgVjw0Og=,CN=SM.QuVIS.com.001,OU=QuVIS Digital Cinema,O=QuVIS.com</dsig:X509IssuerName>
<dsig:X509SerialNumber>363</dsig:X509SerialNumber>
</X509IssuerSerial>
<X509SubjectName>CN=SM MD LE FM.QuVIS_CinemaPlayer-3d_10010,OU=QuVIS,O=QuVIS.com,dnQualifier=3oBfjTfx1me0p1ms7XOX\+eqUUtE=</X509SubjectName>
</Recipient>
<CompositionPlaylistId>urn:uuid:336263da-e4f1-324e-8e0c-ebea00ff79f4</CompositionPlaylistId>
<ContentTitleText>SPIDERMAN-3_FTR_S_EN-XX_US-13_51_4K_PH_20070423_DELUXE</ContentTitleText>
<ContentKeysNotValidBefore>2007-04-30T05:00:00-00:00</ContentKeysNotValidBefore>
<ContentKeysNotValidAfter>2007-04-30T10:00:00-00:00</ContentKeysNotValidAfter>
<KeyIdList>
<KeyId>urn:uuid:9851b0f6-4790-0d4c-a69d-ea8abdedd03d</KeyId>
<KeyId>urn:uuid:8317e8f3-1597-494d-9ed8-08a751ff8615</KeyId>
<KeyId>urn:uuid:5d9b228d-7120-344c-aefc-840cdd32bbfc</KeyId>
<KeyId>urn:uuid:1e32ccb2-ab0b-9d43-b879-1c12840c178b</KeyId>
<KeyId>urn:uuid:44d04416-676a-2e4f-8995-165de8cab78d</KeyId>
<KeyId>urn:uuid:906da0c1-b0cb-4541-b8a9-86476583cdc4</KeyId>
<KeyId>urn:uuid:0fe2d73a-ebe3-9844-b3de-4517c63c4b90</KeyId>
<KeyId>urn:uuid:862fa79a-18c7-9245-a172-486541bef0c0</KeyId>
<KeyId>urn:uuid:aa2f1a88-7a55-894d-bc19-42afca589766</KeyId>
<KeyId>urn:uuid:59d6eeff-cd56-6245-9f13-951554466626</KeyId>
<KeyId>urn:uuid:14a13b1a-76ba-764c-97d0-9900f58af53e</KeyId>
<KeyId>urn:uuid:ccdbe0ae-1c3f-224c-b450-947f43bbd640</KeyId>
<KeyId>urn:uuid:dcd37f10-b042-8e44-bef0-89bda2174842</KeyId>
<KeyId>urn:uuid:9dd7103e-7e5a-a840-a15f-f7d7fe699203</KeyId>
</KeyIdList>
</RequiredExtensions>
<NonCriticalExtensions/>
</AuthenticatedPublic>
<AuthenticatedPrivate Id="ID_AuthenticatedPrivate"><enc:EncryptedKey xmlns:enc="http://www.w3.org/2001/04/xmlenc#">
<enc:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#rsa-oaep-mgf1p">
<ds:DigestMethod xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/>
</enc:EncryptionMethod>
The expression Reel.findtext('Id') only matches direct children of Reel. If you want to find all Id tags in your xml document, you can just use:
ids = [id.text for id in Reel.findall(".//Id")]
This would give you a list of all text nodes of all Id tags which are children of Reel.
edit:
Your updated example uses namespaces, in this case KeyId is in the default namespace (http://www.digicine.com/PROTO-ASDCP-KDM-20040311#), so to search for it you need to include it in your search:
from xml.etree import ElementTree
doc = ElementTree.parse('test.xml')
nsmap = {'ns': 'http://www.digicine.com/PROTO-ASDCP-KDM-20040311#'}
ids = [id.text for id in doc.findall(".//ns:KeyId", namespaces=nsmap)]
print(ids)
...
The xpath subset ElementTree supports is rather limited. If you want a more complete support, you should use lxml instead, it's xpath support is way more complete.
For example, using xpath to search for all KeyId tags (ignoring namespaces) and returning their text content directly:
from lxml import etree
doc = etree.parse('test.xml')
ids = doc.xpath(".//*[local-name()='KeyId']/text()")
print(ids)
...
It sounds like XPath might be right up your alley - it will let you query your XML document for exactly what you're looking for, as long as you know the structure.
Here's what I needed to do. This works for finding whatever I need.
for node in tree.getiterator():
... if 'KeyId' in node.tag:
... mylist = node.tag
... print(mylist)
...

Python: Read and write namespaced XML using ElementTree

This XML file is named example.xml:
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>14.0.0</modelVersion>
<groupId>.com.foobar.flubber</groupId>
<artifactId>uberportalconf</artifactId>
<version>13-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Environment for UberPortalConf</name>
<description>This is the description</description>
<properties>
<birduberportal.version>11</birduberportal.version>
<promotiondevice.version>9</promotiondevice.version>
<foobarportal.version>6</foobarportal.version>
<eventuberdevice.version>2</eventuberdevice.version>
</properties>
<!-- A lot more here, but as it is irrelevant for the problem I have removed it -->
</project>
If I load the example.xml file above using ElementTree and print the root node:
>>> from xml.etree import ElementTree
>>> tree = ElementTree.parse('example.xml')
>>> print tree.getroot()
<Element '{http://maven.apache.org/POM/4.0.0}project' at 0x26ee0f0>
I see that Element also contains the namespace http://maven.apache.org/POM/4.0.0.
How do I:
Get the foobarportal.version text, increase it by one and write the XML file back while keeping the namespace the document had when loaded and also not change the overall XML layout.
Get it to load using any namespace, not just http://maven.apache.org/POM/4.0.0. I still donĀ“t want to strip the namespace, as I want the XML to stay the same except for changing foobarportal.version as in 1 above.
The current way is not aware of XML but fulfills 1 and 2 above:
Grep for <foobarportal.version>(.*)</foobarportal.version>
Take the contents of the match group and i increase it by one
Write it back.
It would be nice to have an XML aware solution, as it would be more robust. The XML namespace handling of ElementTree is making it more complicated.
If your question is simply: "how do I search by a namespaced element name", then the answer is that lxml understands {namespace} syntax, so you can do:
tree.getroot().find('{http://maven.apache.org/POM/4.0.0}project')

Categories