How to add xml header to dom object - python

I'm using Python's xml.dom.minidom but I think the question is valid for any DOM parser.
My original file has a line like this at the beginning:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
This doesn't seem to be part of the dom, so when I do something like dom.toxml() the resulting string have not line at the beginning.
How can I add it?
example outpupt:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<Root xmlns:aid="http://xxxxxxxxxxxxxxxxxx">
<Section>BANDSAW BLADES</Section>
</Root>
hope to be clear.

This doesn't seem to be part of the dom
The XML Declaration doesn't get a node of its own, no, but the properties declared in it are visible on the Document object:
>>> doc= minidom.parseString('<?xml version="1.0" encoding="utf-8" standalone="yes"?><a/>')
>>> doc.encoding
'utf-8'
>>> doc.standalone
True
Serialising the document should include the standalone="yes" part of the declaration, but toxml() doesn't. You could consider this a bug, perhaps, but really the toxml() method doesn't make any promises to serialise the XML declaration in an appropriate way. (eg you don't get an encoding unless you specifically ask for it either.)
You could take charge of writing the document yourself:
xml= []
xml.append('<?xml version="1.0" encoding="utf-8" standalone="yes"?>')
for child in doc.childNodes:
xml.append(child.toxml())
but do you really need the XML Declaration here? You are using the default version and encoding, and since you have no DOCTYPE there can be no externally-defined entities, so the document is already standalone by nature. As per the XML standard: “if there are no external markup declarations, the standalone document declaration has no meaning”. It seems to me you could safely omit it completely.

Related

Python add new element by xml ElementTree

XML file
<?xml version="1.0" encoding="utf-8"?>
<Info xmlns="BuildTest">
<RequestDate>5/4/2020 12:27:46 AM</RequestDate>
</Info>
I want to add a new element inside the Info tag.
Here is what I did.
import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
ele = ET.Element('element1')
ele.text = 'ele1'
root.append(ele)
tree.write("output.xhtml")
Output
<ns0:Info xmlns:ns0="BuildTest">
<ns0:RequestDate>5/4/2020 12:27:46 AM</ns0:RequestDate>
<element1>ele1</element1></ns0:Info>
Three questions:
The <?xml version="1.0" encoding="utf-8"?> is missing.
The namespace is wrong.
The whitespace of the new element is gone.
I saw many questions related to this topic, most of them are suggesting other packages.
Is there any way it can handle properly?
The processing instructions are not considered XML elements. Just Google are processing instructions part of an XML, and the first result states:
Processing instructions are markup, but they're not elements.
Since the package you are using is literally called ElementTree, you can reasonably expect its objects to be a trees of elements. If I remember correctly, DOM compliant XML packages can support non-element markup in XML.
For the namespace issue, the answer is in stack overflow, at Remove ns0 from XML - you just have to register the namespace you specified in the top element of your document. The following worked for me:
ET.register_namespace("", "Buildtest")
As for the whitespace - the new element does not have any whitespace. You can assign to the tail member to add a linefeed after an element.

Python XML iterparse() namespacing

According to this post, I successfully can parse my XML file, and reading it's content. However, if I add namespace to it, the whole thing goes wrong.
Let's consider the following XML:
<root xmlns="MyNamespace">
<A1>
<B1></B1>
<C>1<D1></D1></C>
<E1></E1>
</A1>
<A2>
<B2></B2>
<C>2<D></D></C>
<E2></E2>
</A2>
</root>
My iterparse looks like this:
context = ET.iterparse('../in/process/teszt.xml', events=('end', ), tag='B1')
I found several examples, but to be honest I don't really understand them, and have no idead how to solve this problem.
In case of XML with default namespace, you need to use the namespace URI along with the element's local name in tag :
context = ET.iterparse('../in/process/teszt.xml', events=('end', ), tag='{MyNamespace}B1')

Retrieve content of element with unknown namespace using python

I am attempting to parse a maven project definition using python to extract a version.
The project definition looks like:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>...</groupId>
<artifactId>...</artifactId>
<version>1.6.0-SNAPSHOT</version>
...
</project>
I can extract the version using:
root = ET.fromstring(xml)
version = root.find('./p:version', { 'p': 'http://maven.apache.org/POM/4.0.0' })
print(version.text)
prints: 1.6.0-SNAPSHOT
However, the namespace used may change, and I don't want to depend on this. Is there a way to extract the namespace to use in my subsequent xpath expression?
I tried the following, to see if xmlns was itself exposed, but no luck:
root = ET.fromstring(xml)
for k in root.attrib:
print('%s => %s' % (k, root.attrib[k]))
prints: {http://www.w3.org/2001/XMLSchema-instance}schemaLocation => http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd
However, the namespace used may change, and I don't want to depend on this.
Are you saying that the namespace uri might change, or that the prefix might? If it's just the prefix, then that's not an issue, because what matters is that the prefixes in your XPath match the prefixes you supply to the XPath evaluator. And in either case, auto-detecting the namespaces is probably a bad call. Suppose someone decides to start generating that XML like this:
<proj:project xmlns:proj="http://maven.apache.org/POM/4.0.0"
xmlns:other="http://maven.apache.org/POM/5.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd">
which is still perfectly representing the XML in the same namespace as your example, but you have no idea that the proj prefix is the namespace prefix you're looking for.
I think it's unlikely that Apache would suddenly change the namespace for one of their official XML formats, but if you are genuinely worried about it, there should always be the option of using local-name() to namespace-agnostically find a node you're looking for:
version = root.find('./*[local-name() = "version"]')
Also, I'm not familiar with the elementTree library, but you could try this to try to get information about the XML document's namespaces, just to see if you can:
namespaces = root.findall('//namespace::*')
Unfortunately, ElementTree namespace support is rather patchy.
You'll need to use an internal method from the xml.etree.ElementTree module to get a namespace map out:
_, namespaces = ET._namespaces(root, 'utf8')
namespaces is now a dict with URIs as keys, and prefixes as values.
You could switch to lxml instead. That library implements the same ElementTree API, but has augmented that API considerably.
For example, each node includes a .nsmap attribute which maps prefixes to URIs, including the default namespace under the key None.

Python: Read and write namespaced XML using ElementTree

This XML file is named example.xml:
<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>14.0.0</modelVersion>
<groupId>.com.foobar.flubber</groupId>
<artifactId>uberportalconf</artifactId>
<version>13-SNAPSHOT</version>
<packaging>pom</packaging>
<name>Environment for UberPortalConf</name>
<description>This is the description</description>
<properties>
<birduberportal.version>11</birduberportal.version>
<promotiondevice.version>9</promotiondevice.version>
<foobarportal.version>6</foobarportal.version>
<eventuberdevice.version>2</eventuberdevice.version>
</properties>
<!-- A lot more here, but as it is irrelevant for the problem I have removed it -->
</project>
If I load the example.xml file above using ElementTree and print the root node:
>>> from xml.etree import ElementTree
>>> tree = ElementTree.parse('example.xml')
>>> print tree.getroot()
<Element '{http://maven.apache.org/POM/4.0.0}project' at 0x26ee0f0>
I see that Element also contains the namespace http://maven.apache.org/POM/4.0.0.
How do I:
Get the foobarportal.version text, increase it by one and write the XML file back while keeping the namespace the document had when loaded and also not change the overall XML layout.
Get it to load using any namespace, not just http://maven.apache.org/POM/4.0.0. I still don´t want to strip the namespace, as I want the XML to stay the same except for changing foobarportal.version as in 1 above.
The current way is not aware of XML but fulfills 1 and 2 above:
Grep for <foobarportal.version>(.*)</foobarportal.version>
Take the contents of the match group and i increase it by one
Write it back.
It would be nice to have an XML aware solution, as it would be more robust. The XML namespace handling of ElementTree is making it more complicated.
If your question is simply: "how do I search by a namespaced element name", then the answer is that lxml understands {namespace} syntax, so you can do:
tree.getroot().find('{http://maven.apache.org/POM/4.0.0}project')

Crunching xml with python

I need to remove white spaces between xml tags, e.g. if the original xml looks like:
<node1>
<node2>
<node3>foo</node3>
</node2>
</node1>
I'd like the end-result to be crunched down to single line:
<node1><node2><node3>foo</node3></node2></node1>
Please note that I will not have control over the xml structure, so the solution should be generic enough to be able to handle any valid xml. Also the xml might contain CDATA blocks, which I'd need to exclude from this crunching and leave them as-is.
I have couple of ideas so far: (1) parse the xml as text and look for start and end of tags < and > (2) another approach is to load the xml document and go node-by-node and print out a new document by concatenating the tags.
I think either method would work, but I'd rather not reinvent the wheel here, so may be there is a python library that already does something like this? If not, then any issues/pitfalls to be aware of when rolling out my own cruncher? Any recommendations?
EDIT
Thank you all for answers/suggestions, both Triptych's and Van Gale's solutions work for me and do exactly what I want. Wish I could accept both answers.
This is pretty easily handled with lxml (note: this particular feature isn't in ElementTree):
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
foo = """<node1>
<node2>
<node3>foo </node3>
</node2>
</node1>"""
bar = etree.XML(foo, parser)
print etree.tostring(bar,pretty_print=False,with_tail=True)
Results in:
<node1><node2><node3>foo </node3></node2></node1>
Edit: The answer by Triptych reminded me about the CDATA requirements, so the line creating the parser object should actually look like this:
parser = etree.XMLParser(remove_blank_text=True, strip_cdata=False)
I'd use XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
That should do the trick.
In python you could use lxml (direct link to sample on homepage) to transform it.
For some tests, use xsltproc, sample:
xsltproc test.xsl test.xml
where test.xsl is the file above and test.xml your XML file.
Pretty straightforward with BeautifulSoup.
This solution assumes it is ok to strip whitespace from the tail ends of character data.
Example: <foo> bar </foo> becomes <foo>bar</foo>
It will correctly ignore comments and CDATA.
import BeautifulSoup
s = """
<node1>
<node2>
<node3>foo</node3>
</node2>
<node3>
<!-- I'm a comment! Leave me be! -->
</node3>
<node4>
<![CDATA[
I'm CDATA! Changing me would be bad!
]]>
</node4>
</node1>
"""
soup = BeautifulSoup.BeautifulStoneSoup(s)
for t in soup.findAll(text=True):
if type(t) is BeautifulSoup.NavigableString: # Ignores comments and CDATA
t.replaceWith(t.strip())
print soup
Not a solution really but since you asked for recommendations: I'd advise against doing your own parsing (unless you want to learn how to write a complex parser) because, as you say, not all spaces should be removed. There are not only CDATA blocks but also elements with the "xml:space=preserve" attribute, which correspond to things like <pre> in XHTML (where the enclosed whitespaces actually have meaning), and writing a parser that is able to recognize those elements and leave the whitespace alone would be possible but unpleasant.
I would go with the parsing method, i.e. load the document and go node-by-node printing them out. That way you can easily identify which nodes you can strip the spaces out of and which you can't. There are some modules in the Python standard library, none of which I have ever used ;-) that could be useful to you... try xml.dom, or I'm not sure if you could do this with xml.parsers.expat.

Categories