Use Python to edit XML header - python

I've written a Python script to create some XML, but I didn't find a way to edit the heading within Python.
Basically, instead of having this as my heading:
<?xml version="1.0" ?>
I need to have this:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
I looked into this as best I could, but found no way to add standalone status from within Python. So I figured that I'd have to go over the file after it had been created and then replace the text. I read in several places that I should stay far away from using readlines() because it could ruin the XML formatting.
Currently the code I have to do this - which I got from another Stackoverflow post about editing XML with Python - is:
doc = parse('file.xml')
elem = doc.find('xml version="1.0"')
elem.text = 'xml version="1.0" encoding="UTF-8" standalone="no"'
That provides me with a KeyError. I've tried to debug it to no avail, which leads me to believe that perhaps the XML heading wasn't meant to be edited in this way. Or my code is just wrong.
If anyone is curious (or miraculously knows how to insert standalone status into Python code), here is my code to write the xml file:
with open('file.xml', 'w') as f:
f.write(doc.toprettyxml(indent=' '))
Some people don't like "toprettyxml", but with my relatively basic level, it seemed like the best bet.
Anyway, if anyone can provide some advice or insight, I would be very grateful.

The xml.etree API does not give you any options to write out a standalone attribute in the XML declaration.
You may want to use the lxml library instead; it uses the same ElementTree API, but offers more control over the output. tostring() supports a standalone flag:
from lxml import etree
etree.tostring(doc, pretty_print=True, standalone=False)
or use .write(), which support the same options:
doc.write(outputfile, pretty_print=True, standalone=False)

Related

How to write XML with "usual" self-closing/header tags?

This is my code:
from xml.dom import minidom as md
doc = md.parse('file.props')
# operations with doc
xml_file = open('file.props', "w")
doc.writexml(xml_file, encoding="utf-8")
xml_file.close()
I parse an XML, I do some operations, than I open and write on it. But for example, if in my file got:
<MY_TAG />
^
it rewrites as:
<MY_TAG/>
^
I know this can seem irrelevant, but my file is constantly monitored by version control GIT, which say that line is "different" on every write.
The same with the header:
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
it becomes:
<?xml version="1.0" encoding="utf-8"?><Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
Which is pretty annoying. Any clues?
Retaining quirks of formatting in XML through parsing and serialization is pretty well impossible. If you need to do text-level comparisons, the only way to do it is to canonicalize the formats that you are comparing (there are various XML canonicalization tools out there).
In principle you can configure git to use an XML-aware diff tool for comparisons, but please don't ask me for details, it's not something I've ever done myself. I've always just lived with the fact that it works badly.

How to (push) parse XML files in Python?

I've already seen this question, but it's from the 2009.
What's a simple modern way to handle XML files in Python 3?
I.e., from this TLD (adapted from here):
<?xml version="1.0" encoding="UTF-8" ?>
<taglib>
<tlib-version>1.0</tlib-version>
<short-name>bar-baz</short-name>
<tag>
<name>present</name>
<tag-class>condpkg.IfSimpleTag</tag-class>
<body-content>scriptless</body-content>
<attribute>
<name>test</name>
<required>true</required>
<rtexprvalue>true</rtexprvalue>
</attribute>
</tag>
</taglib>
I want to parse TLD files (Java Server Pages Tag Library Descriptors), to obtain some sort of structure in Python (I have still to decide about that part).
Hence, I need a push parser. But I won't do much more with it, so I'd rather prefer a simple API (I'm new to Python).
xml.etree.ElementTree is still there, in the standard library:
import xml.etree.ElementTree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
If you look outside of the standard library, there is a very popular and fast lxml module that follows the ElementTree interface and supports Python3:
from lxml import etree as ET
data = """your xml here"""
tree = ET.fromstring(data)
print(tree.find('tag/name').text) # prints "present"
Besides, there is lxml.objectify that allows you to deal with XML structure like with a Python object.

How can i deal with xml namespace in python?

I'm not going to parse a xml file but i want to move it in python. (i.e etree)
I know about the basic of etree however all i want to know is about xml namespace.
I've got 3 namespaces in code. Here is the code that i want to transfer.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?><ns1:g2sMessage xmlns:ns1="http://www.gamingstandards.com/g2s/schemas/v1.0.3"><ns1:g2sBody ns1:dateTimeSent="2014-08-14T15:30:48.692+09:00" ns1:egmId="TestAppEGMID" ns1:hostId="1"><ns1:communications ns1:deviceId="0" ns1:sessionId="1001" ns1:sessionType="G2S_request" ns1:timeToLive="100" ns1:commandId="1001" ns1:dateTime="2014-08-14T06:30:48.696Z"><ns1:keepAlive/></ns1:communications></ns1:g2sBody></ns1:g2sMessage>TestAppEGMID1
Does anyone have an idea?

Python ElementTree - print out namespace definitions?

I'm using Python's elementtree to parse some XML configuration files.
At the top of the file, I have a root element like this:
<?xml version="1.0" encoding="utf-8"?>
<sgx:FooConfig
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:foo="http://ns.au.firm.com/foo.xsd"
xmlns:bar="http://ns.au.firm.com/bar.xsd"
>
The problem is, the bar namespace can be set to one of two different XSDs, depending on the version of the configuration file.
I'm looking for a way to print out the namespace mapping using ElementTree, so I can check which of the two XSDs is being used - then I can get my code to handle the correct case.
Is there a way to print out all the namespace definitions out using Python?
Cheers,
Victor
What you have is not valid xml (undefined prefixes) and I think you can't do this with xml.etree but you should be able to do it using lxml.
import lxml.etree as et
tree = et.XML(yourxml)
print tree.nsmap

Parsing log4j in Python

I have a log file (log4j xml-ish format) that I am trying to pull info out of and use in my Python module. Could I treat this file as if it were XML? My gut is telling me no... If not, what is the best way to parse the data? Below is a section of the log file. The file does not include your standard doctype or version headers which is why I said "xml-ish."
<log4j:event
logger="com.hp.cp.elk.impl.subscriptions.AsyncSimpleSubscriptionManager"
timestamp="1352320517430" level="DEBUG" thread="Thread-77">
<log4j:message><![CDATA[Broadcasting signals to subscribers...]]></log4j:message>
</log4j:event>
<log4j:event logger="com.hp.cp.jdf.idp.queue.IDPJobProgressMonitor"
timestamp="1352320517430" level="DEBUG" thread="IDPJobProgressMonitorThread">
<log4j:message><![CDATA[[JDFQueueEntry[ --> JDFAutoQueueEntry[ --> JDFElement[
--> <?xml version="1.0" encoding="UTF-8"?><QueueEntry
xmlns="http://www.CIP4.org/JDFSchema_1_1"
DescriptiveName="H44E61-6.pdf" DeviceID="HPPRO1-SM1"
EndTime="2012-11-07T10:58:18-08:00" JobID="Default" Priority="50"
QueueEntryID="d5fbbe98a1194e0da573b51a0c8040fb" Status="Completed"
SubmissionTime="2012-11-06T16:35:06-08:00"> <Comment AgentName="CIP4 JDF Writer
Java" AgentVersion="1.4a BLD 63" ID="c_121106_163506894_000804"
Name="JobSpec">WBG_4C_Flat_21up_BusCards_Duplex</Comment>
</QueueEntry>
] ] ]] queue entries changed.]]></log4j:message>
</log4j:event>
<log4j:event logger="com.hp.cp.jdf.idp.queue.IDPJobProgressMonitor"
timestamp="1352320517430" level="DEBUG" thread="IDPJobProgressMonitorThread">
<log4j:message><![CDATA[no active queue entries changed.]]></log4j:message>
</log4j:event>
Sorry for the messy code, I just wanted to make you all can get an idea of the formatting. Anyway, I'm currently just trying to pull the value from QueueEntryID="d5fbbe98a1194e0da573b51a0c8040fb" Any suggestions? Thank you!
I would imagine that you could use standard XML tools like DOM or SAX to parse this. Otherwise, have fun with re or htmllib.

Categories