Python: lxml not pretty-printing newly added nodes - python

I'm using a Python script to add nodes (or copy existing nodes) in an XML file. The script uses lxml library. Here is existing snippet:
<entitlements>
<bpuiEnabledForSubusers>true</bpuiEnabledForSubusers>
<appCodesAllowedForSubusers>My Accounts,Bill Pay</appCodesAllowedForSubusers>
<enabled>true</enabled>
<monitored>true</monitored>
</entitlements>
So I use lxml to copy a node in the entitlements node. Then, when I
return etree.tostring(self.root,encoding='unicode', pretty_print=True)
I get the following xml:
<entitlements>
<bpuiEnabledForSubusers>true</bpuiEnabledForSubusers>
<appCodesAllowedForSubusers>My Accounts,Bill Pay</appCodesAllowedForSubusers>
<enabled>true</enabled>
<monitored>true</monitored>
<appCodesAllowedForSubusersCopy>My Accounts,Bill Pay</appCodesAllowedForSubusersCopy></entitlements>
So the node is properly copied and added to the end of the child nodes, but in the XML it is not indented to the level of its siblings, and the parent's closing tag is on the same line, even though I used the pretty_print option. Although the resulting XML is technically correct, it does not "look good" according to our existing standards.
Any idea why this is happening?
Thanks...

pretty_print=True only has useful effect when your tree doesn't have trailing whitespace on the nodes already. Thus, you want to look at not just how your emit them, but how you're parsing them in the first place.
Use the remove_blank_text=True parser option:
parser = etree.XMLParser(remove_blank_text=True)

Related

How to stop Python ElementTree from doing <element /> instead of <element></element>? [duplicate]

When creating an XML file with Python's etree, if we write to the file an empty tag using SubElement, I get:
<MyTag />
Unfortunately, our XML parser library used in Fortran doesn't handle this even though it's a correct tag. It needs to see:
<MyTag></MyTag>
Is there a way to change the formatting rules or something in etree to make this work?
As of Python 3.4, you can use the short_empty_elements argument for both the tostring() function and the ElementTRee.write() method:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), short_empty_elements=False)
b'<mytag></mytag>'
In older Python versions, (2.7 through to 3.3), as a work-around you can use the html method to write out the document:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), method='html')
'<mytag></mytag>'
Both the ElementTree.write() method and the tostring() function support the method keyword argument.
On even earlier versions of Python (2.6 and before) you can install the external ElementTree library; version 1.3 supports that keyword.
Yes, it sounds a little weird, but the html output mostly outputs empty elements as a start and end tag. Some elements still end up as empty tag elements; specifically <link/>, <input/>, <br/> and such. Still, it's that or upgrade your Fortran XML parser to actually parse standards-compliant XML!
This was directly solved in Python 3.4. From then, the write method of xml.etree.ElementTree.ElementTree has the short_empty_elements parameter which:
controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.
More details in the xml.etree documentation.
Adding an empty text is another option:
etree.SubElement(parent, 'child_tag_name').text=''
But note that this will change not only the representation but also the structure of the document: i.e. child_el.text will be '' instead of None.
Oh, and like Martijn said, try to use better libraries.
If you have sed available, you could pipe the output of your python script to
sed -e "s/<\([^>]*\) \/>/<\1><\/\1>/g"
Which will find any occurence of <Tag /> and replace it by <Tag></Tag>
Paraphrasing the code, the version of ElementTree.py I use contains the following in a _write method:
write('<' + tagname)
...
if node.text or len(node): # this line is literal
write('>')
...
write('</%s>' % tagname)
else:
write(' />')
To steer the program counter I created the following:
class AlwaysTrueString(str):
def __nonzero__(self): return True
true_empty_string = AlwaysTrueString()
Then I set node.text = true_empty_string on those ElementTree nodes where I want an open-close tag rather than a self-closing one.
By "steering the program counter" I mean constructing a set of inputs—in this case an object with a somewhat curious truth test—to a library method such that the invocation of the library method traverses its control flow graph the way I want it to. This is ridiculously brittle: in a new version of the library, my hack might break—and you should probably treat "might" as "almost guaranteed". In general, don't break abstraction barriers. It just worked for me here.
If you have python >=3.4, use the short_empty_elements=Falseoption as has been shown in other answers already, but:
If you have the XML in string form already and can't touch the code
where it's generated..
If you're in a situation where you are stuck with python <3.4..
If you're using a different XML library that insists on self-closing tags..
Then this works:
xml = "<foo/><bar/>"
xml = re.sub(r'<([^\/]+)\/\>', r'<\1></\1>', xml)
print(xml)
# output will be
# <foo></foo><bar></bar>

Parse xbrl file in python

I am working on a xml parser.
The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.
I am hence trying either:
to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
just parse the xml by tag
I have tried with xml.etree.ElementTree.
I also had a look at lxml
I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.
Interestingly, parsed_file = etree.XML(file) fails with the error:
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
One example of the files I would like to parse is here
Do not care about ns prefixes, care about complete namespaces
Sometime, people do care about those short prefixes and forgetting, the are of secondary importance. They are only short reference to fully qualified namespace. E.g.
xmlns:trw="http://www.trw.com/20131231"
in xml means, from now on, the "trw:" stands for fully qualified namespace "http://www.trw.com/20131231". Note, that this prefix can be redefined to any other namespace in any following element and may get completely different meaning.
On the other hand, when you care about real meaning, what means here fully qualified namespace, you may think of "trw:row" as "{http://www.trw.com/20131231}row". This translated meaning will be reliable and will not change with prefix changes.
Parsing referred xml
The link to http://edgar.sec.gov/Archives/edgar/data/1267097/000104746914000925/trw-20131231.xml leads to an xml, which validates by xmlstarlet and which lxml is able to parse.
The error message you show is referring to very first character of the stream, so chances are you either met BOM byte in your file, or you are trying to read xml, which is gzipped and shall be decompressed first.
lxml and namespaces
lxml works with namespaces well. It allows you to use XPath expressions, which use namespaces. With controlling namspace prefix on output it is a bit more complex, as it is dependent on xmlns attributes, which are part of serialized document. If you want to modify the prefixes, you must somehow organize these xmlns attributes, often by moving all of the to the root element. At the same time, lxml keeps track of fully qualified namespace of each element, so at the moment of serialization, it will respect this full name as well as currently valid prefix for this namespace.
Handling these xmlna attributes is a bit of more code, refer to lxml documentation.
items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")
did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.

Python etree control empty tag format

When creating an XML file with Python's etree, if we write to the file an empty tag using SubElement, I get:
<MyTag />
Unfortunately, our XML parser library used in Fortran doesn't handle this even though it's a correct tag. It needs to see:
<MyTag></MyTag>
Is there a way to change the formatting rules or something in etree to make this work?
As of Python 3.4, you can use the short_empty_elements argument for both the tostring() function and the ElementTRee.write() method:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), short_empty_elements=False)
b'<mytag></mytag>'
In older Python versions, (2.7 through to 3.3), as a work-around you can use the html method to write out the document:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), method='html')
'<mytag></mytag>'
Both the ElementTree.write() method and the tostring() function support the method keyword argument.
On even earlier versions of Python (2.6 and before) you can install the external ElementTree library; version 1.3 supports that keyword.
Yes, it sounds a little weird, but the html output mostly outputs empty elements as a start and end tag. Some elements still end up as empty tag elements; specifically <link/>, <input/>, <br/> and such. Still, it's that or upgrade your Fortran XML parser to actually parse standards-compliant XML!
This was directly solved in Python 3.4. From then, the write method of xml.etree.ElementTree.ElementTree has the short_empty_elements parameter which:
controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.
More details in the xml.etree documentation.
Adding an empty text is another option:
etree.SubElement(parent, 'child_tag_name').text=''
But note that this will change not only the representation but also the structure of the document: i.e. child_el.text will be '' instead of None.
Oh, and like Martijn said, try to use better libraries.
If you have sed available, you could pipe the output of your python script to
sed -e "s/<\([^>]*\) \/>/<\1><\/\1>/g"
Which will find any occurence of <Tag /> and replace it by <Tag></Tag>
Paraphrasing the code, the version of ElementTree.py I use contains the following in a _write method:
write('<' + tagname)
...
if node.text or len(node): # this line is literal
write('>')
...
write('</%s>' % tagname)
else:
write(' />')
To steer the program counter I created the following:
class AlwaysTrueString(str):
def __nonzero__(self): return True
true_empty_string = AlwaysTrueString()
Then I set node.text = true_empty_string on those ElementTree nodes where I want an open-close tag rather than a self-closing one.
By "steering the program counter" I mean constructing a set of inputs—in this case an object with a somewhat curious truth test—to a library method such that the invocation of the library method traverses its control flow graph the way I want it to. This is ridiculously brittle: in a new version of the library, my hack might break—and you should probably treat "might" as "almost guaranteed". In general, don't break abstraction barriers. It just worked for me here.
If you have python >=3.4, use the short_empty_elements=Falseoption as has been shown in other answers already, but:
If you have the XML in string form already and can't touch the code
where it's generated..
If you're in a situation where you are stuck with python <3.4..
If you're using a different XML library that insists on self-closing tags..
Then this works:
xml = "<foo/><bar/>"
xml = re.sub(r'<([^\/]+)\/\>', r'<\1></\1>', xml)
print(xml)
# output will be
# <foo></foo><bar></bar>

python lxml adds unused namespaces

I'm having an issue when using lxml's find() method to select a node in an xml file. Essentially I am trying to move a node from one xml file to another.
File 1:
<somexml xmlns:a='...' xmlns:b='...' xmlns:c='...'>
<somenode id='foo'>
<something>bar</something>
</somenode>
</somexml>
Once I parse File 1 and do a find on it:
node = tree.find('//*[#id="foo"]')
Node looks like this:
<somenode xmlns:a='...' xmlns:b='...' xmlns:c='...'>
<something>bar</something>
</somenode>
Notice it added the namespaces that were found in the document to that node. However, nothing in that node uses any of those namespaces. How would I go about either A) not writing namespaces that aren't used in the selected node, or B) removing unused name space declarations? If it's being used in the selected node then I will need it, but otherwise, I would like to get rid of them. Any ideas? Thanks!
If the namespaces are in the document, then the document uses the namespaces. The namespaces are being used in those nodes, because those nodes are part of the subtree which declared the namespace. Follow the link given by Daenyth to remove them, or strip them off the XML string before you turn it into an lxml object.

Comments in XML at beginning of document

my PYTHON xml parser fails if there´s a comment at the beginnging of an xml file like::
<?xml version="1.0" encoding="utf-8"?>
<!-- Script version: "1"-->
<!-- Date: "07052010"-->
<component name="abc">
<pp>
....
</pp>
</component>
is it illegal to place a comment like this?
EDIT:
well it´s not throwing an error but the DOM module will fail and not recognize the child nodes:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
for component in sub_tree.firstChild.childNodes:
print(component)
I cannot acces the child nodes; sub_tree.firstChild.childNodes returns an empty list,but if I remove those 2 comments I can loop through the list and read the childnodes as usual!
EDIT:
Guys, this simple example is working and enough to figure it out. start your python shell and execute this small code above. Once it will output nothing and after deleting the comments it will show up the node!
If you do this:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
print sub_tree.children
You will see what is your problem:
>>> print sub_tree.childNodes
[<DOM Comment node " Script ve...">, <DOM Comment node " Date: "07...">, <DOM Element: component at 0x7fecf88c>]
firstChild will obviously pick up the first child, which is a comment and doesn't have any children of its own.
You could iterate over the children and skip all comment nodes.
Or you could ditch the DOM model and use ElementTree, which is so much nicer to work with. :)
It is legal; from XML 1.0 Reference:
2.5 Comments
[Definition: Comments may appear
anywhere in a document outside other
markup; in addition, they may appear
within the document type declaration
at places allowed by the grammar. They
are not part of the document's
character data; an XML processor MAY,
but need not, make it possible for an
application to retrieve the text of
comments. For compatibility, the
string " -- " (double-hyphen) MUST NOT
occur within comments.] Parameter
entity references MUST NOT be
recognized within comments.
To get better answers, show us (a) a small complete Python script and (b) a small complete XML document that together demonstrate the unexpected behaviour.
Have you considered using ElementTree?
That should be legal as long as the XML declaration is on the first line.

Categories