Retaining empty elements when parsing with ElementTree - python

Using Python 3.4 and ElementTree, I'm trying to add a sub-element to an xml file, keeping the xml file (written in UTF-16) otherwise exactly the same.
My code:
new = new_XML_file.xml
tree = ET.parse(new)
root = tree.getroot()
new_element = ET.SubElement(root, 'RENAMED_SOUND_FILE')
new_element.text=new.split('\\')[num][:-4]+'.wav'
tree.write(fake_path++new.split('\\')[num], encoding='utf-16', xml_declaration=True)
The problem I'm having is that empty elements are being changed in this process. For example:
<EMPTY_ELEMENT></EMPTY_ELEMENT>
becomes:
<EMPTY_ELEMENT />
I know that to a machine, this is basically the same thing, but I'd like to retain the earlier formatting for testing purposes.
Any ideas on how I can retain the full empty elements?

Per the documentation, output methods (whether you're using tostring methods or write) have a "short_empty_elements" keyword that defaults to True. Making this False should give you your desired output:
import xml.etree.ElementTree as ET
root=ET.Element("root")
print(ET.tostring(root,short_empty_elements=False))

Related

Parsing XML in python: selecting an attribute given that a child node has a specific attribute

Given the xml
xmlstr = '''
<myxml>
<Description id="10">
<child info="myurl"/>
</Description>
</myxml>'
I'd like to get the id of Description only where child has an attribute of info.
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
a = root.find(".//Description/[child/#info]")
print(a.attrib)
and changing the find to .//Description/[child[#info]]
both return an error of:
SyntaxError: invalid predicate
I know that etree only supports a subset of xpath, but this doesn't seem particularly weird - should this work? If so, what have I done wrong?!
Changing the find to .//Description/[child] does work, and returns
{'id': '10'}
as expected
You've definitely hit that XPath limited support limitation as, if we look at the source directly (looking at 3.7 source code), we could see that while parsing the Element Path expression, only these things in the filters are considered:
[#attribute] predicate
[#attribute='value']
[tag]
[.='value'] or [tag='value']
[index] or [last()] or [last()-index]
Which means that both of your rather simple expressions are not supported.
If you really want/need to stick with the built-in ElementTree library, one way to solve this would be with finding all Description tags via .findall() and filtering the one having a child element with info attribute.
You can also get those values as keys, which makes it a bit more structured approach to gather data:
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
wht =root.find(".//Description")
wht.keys() #--> ['id']
wht.get('id') # --> '10'

Python, OOXML, ElementTree and document root attributes

ElementTree (Python 2.7) does not see the attributes of the root element, for example, for tag <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> - get an empty dictionary. I want "on the fly"to get the namespace for working with tags. Library xml.dom.minidom works fine, but I don't want to lose features with ET. Code example:
from xml.etree import ElementTree as ET
import zipfile
path = '/path/to/sample.docx'
zf = zipfile.ZipFile(path, 'r')
root = ET.fromstring(zf.read('word/document.xml'))
print(root.tag, root.attrib) # =>
# ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}document', {})
An XML namespace declaration (a thing starting with xmlns:) is not an attribute. I think that's why you're not seeing it appear in the attrib dictionary. There are other ways of working with namespaces, so if you can say more about the purposes you're working to serve I may be able to be of more help.
The namespaces (and their prefixes) of WordprocessingML elements are well known and documented, and relatively few in number. There are some tens at most and only a small handful that appear in most documents. So depending on what you're trying to accomplish it may be easier to get done than it might seem.

How to stop Python ElementTree from doing <element /> instead of <element></element>? [duplicate]

When creating an XML file with Python's etree, if we write to the file an empty tag using SubElement, I get:
<MyTag />
Unfortunately, our XML parser library used in Fortran doesn't handle this even though it's a correct tag. It needs to see:
<MyTag></MyTag>
Is there a way to change the formatting rules or something in etree to make this work?
As of Python 3.4, you can use the short_empty_elements argument for both the tostring() function and the ElementTRee.write() method:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), short_empty_elements=False)
b'<mytag></mytag>'
In older Python versions, (2.7 through to 3.3), as a work-around you can use the html method to write out the document:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), method='html')
'<mytag></mytag>'
Both the ElementTree.write() method and the tostring() function support the method keyword argument.
On even earlier versions of Python (2.6 and before) you can install the external ElementTree library; version 1.3 supports that keyword.
Yes, it sounds a little weird, but the html output mostly outputs empty elements as a start and end tag. Some elements still end up as empty tag elements; specifically <link/>, <input/>, <br/> and such. Still, it's that or upgrade your Fortran XML parser to actually parse standards-compliant XML!
This was directly solved in Python 3.4. From then, the write method of xml.etree.ElementTree.ElementTree has the short_empty_elements parameter which:
controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.
More details in the xml.etree documentation.
Adding an empty text is another option:
etree.SubElement(parent, 'child_tag_name').text=''
But note that this will change not only the representation but also the structure of the document: i.e. child_el.text will be '' instead of None.
Oh, and like Martijn said, try to use better libraries.
If you have sed available, you could pipe the output of your python script to
sed -e "s/<\([^>]*\) \/>/<\1><\/\1>/g"
Which will find any occurence of <Tag /> and replace it by <Tag></Tag>
Paraphrasing the code, the version of ElementTree.py I use contains the following in a _write method:
write('<' + tagname)
...
if node.text or len(node): # this line is literal
write('>')
...
write('</%s>' % tagname)
else:
write(' />')
To steer the program counter I created the following:
class AlwaysTrueString(str):
def __nonzero__(self): return True
true_empty_string = AlwaysTrueString()
Then I set node.text = true_empty_string on those ElementTree nodes where I want an open-close tag rather than a self-closing one.
By "steering the program counter" I mean constructing a set of inputs—in this case an object with a somewhat curious truth test—to a library method such that the invocation of the library method traverses its control flow graph the way I want it to. This is ridiculously brittle: in a new version of the library, my hack might break—and you should probably treat "might" as "almost guaranteed". In general, don't break abstraction barriers. It just worked for me here.
If you have python >=3.4, use the short_empty_elements=Falseoption as has been shown in other answers already, but:
If you have the XML in string form already and can't touch the code
where it's generated..
If you're in a situation where you are stuck with python <3.4..
If you're using a different XML library that insists on self-closing tags..
Then this works:
xml = "<foo/><bar/>"
xml = re.sub(r'<([^\/]+)\/\>', r'<\1></\1>', xml)
print(xml)
# output will be
# <foo></foo><bar></bar>

Python and ElementTree: write() isn't working properly

First question. If I screwed up somehow let me know.
Ok, what I need to do is the following. I'm trying to use Python to get some data from an API. The API sends it to me in XML. I'm trying to use ElementTree to parse it.
Now every time I request information from the API, it's different. I want to construct a list of all the data I get. I could use Python's lists, but since I want to save it to a file at the end I figured - why not use ElementTree for that too.
Start with an Element, lets call it ListE. Call the API, parse the XML, get the root Element from the ElementTree. Add the root Element as a subelement into ListE. Call the API again, and do it all over. At the end ListE should be an Element whose subelements are the results of each API call. And the end of everything just wrap ListE into an ElementTree in order to use the ElementTree write() function. Below is the code.
import xml.etree.ElementTree as ET
url = "http://http://api.intrade.com/jsp/XML/MarketData/ContractBookXML.jsp?id=769355"
try:
returnurl=urlopen(url)
except IOError:
exit()
tree = ET.parse(returnurl)
root = tree.getroot()
print "root tag and attrib: ",root.tag, root.attrib
historyE = ET.Element('historical data')
historyE.append(root)
historyE.append(root)
historyET = ET.ElementTree(historyE)
historyET.write('output.xml',"UTF-8")
The program doesn't return any error. The problem is when I ask the browser to open it, it claims a syntax error. Opening the file with notepad here's what I find:
<?xml version='1.0' encoding='UTF-8'?>
<historical data><ContractBookInfo lastUpdateTime="0">
<contractInfo conID="769355" expiryPrice="100.0" expiryTime="1357334563000" state="S" vol="712" />
</ContractBookInfo><ContractBookInfo lastUpdateTime="0">
<contractInfo conID="769355" expiryPrice="100.0" expiryTime="1357334563000" state="S" vol="712" />
</ContractBookInfo></historical data>
I think the reason for the syntax error is that there isn't a space or a return between 'historical data' and 'ContractBookInfo lastUpdateTime="0"'. Suggestions?
The problem is here:
historyE = ET.Element('historical data')
You shouldn't use a space. As summarized on Wikipedia:
The element tags are case-sensitive; the beginning and end tags must
match exactly. Tag names cannot contain any of the characters
!"#$%&'()*+,/;<=>?#[]^`{|}~, nor a space character, and cannot start
with -, ., or a numeric digit.
See this section of the XML spec for the details ("Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters.")

Python etree control empty tag format

When creating an XML file with Python's etree, if we write to the file an empty tag using SubElement, I get:
<MyTag />
Unfortunately, our XML parser library used in Fortran doesn't handle this even though it's a correct tag. It needs to see:
<MyTag></MyTag>
Is there a way to change the formatting rules or something in etree to make this work?
As of Python 3.4, you can use the short_empty_elements argument for both the tostring() function and the ElementTRee.write() method:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), short_empty_elements=False)
b'<mytag></mytag>'
In older Python versions, (2.7 through to 3.3), as a work-around you can use the html method to write out the document:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), method='html')
'<mytag></mytag>'
Both the ElementTree.write() method and the tostring() function support the method keyword argument.
On even earlier versions of Python (2.6 and before) you can install the external ElementTree library; version 1.3 supports that keyword.
Yes, it sounds a little weird, but the html output mostly outputs empty elements as a start and end tag. Some elements still end up as empty tag elements; specifically <link/>, <input/>, <br/> and such. Still, it's that or upgrade your Fortran XML parser to actually parse standards-compliant XML!
This was directly solved in Python 3.4. From then, the write method of xml.etree.ElementTree.ElementTree has the short_empty_elements parameter which:
controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.
More details in the xml.etree documentation.
Adding an empty text is another option:
etree.SubElement(parent, 'child_tag_name').text=''
But note that this will change not only the representation but also the structure of the document: i.e. child_el.text will be '' instead of None.
Oh, and like Martijn said, try to use better libraries.
If you have sed available, you could pipe the output of your python script to
sed -e "s/<\([^>]*\) \/>/<\1><\/\1>/g"
Which will find any occurence of <Tag /> and replace it by <Tag></Tag>
Paraphrasing the code, the version of ElementTree.py I use contains the following in a _write method:
write('<' + tagname)
...
if node.text or len(node): # this line is literal
write('>')
...
write('</%s>' % tagname)
else:
write(' />')
To steer the program counter I created the following:
class AlwaysTrueString(str):
def __nonzero__(self): return True
true_empty_string = AlwaysTrueString()
Then I set node.text = true_empty_string on those ElementTree nodes where I want an open-close tag rather than a self-closing one.
By "steering the program counter" I mean constructing a set of inputs—in this case an object with a somewhat curious truth test—to a library method such that the invocation of the library method traverses its control flow graph the way I want it to. This is ridiculously brittle: in a new version of the library, my hack might break—and you should probably treat "might" as "almost guaranteed". In general, don't break abstraction barriers. It just worked for me here.
If you have python >=3.4, use the short_empty_elements=Falseoption as has been shown in other answers already, but:
If you have the XML in string form already and can't touch the code
where it's generated..
If you're in a situation where you are stuck with python <3.4..
If you're using a different XML library that insists on self-closing tags..
Then this works:
xml = "<foo/><bar/>"
xml = re.sub(r'<([^\/]+)\/\>', r'<\1></\1>', xml)
print(xml)
# output will be
# <foo></foo><bar></bar>

Categories