Python, OOXML, ElementTree and document root attributes

Python, OOXML, ElementTree and document root attributes - python

ElementTree (Python 2.7) does not see the attributes of the root element, for example, for tag <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> - get an empty dictionary. I want "on the fly"to get the namespace for working with tags. Library xml.dom.minidom works fine, but I don't want to lose features with ET. Code example:
from xml.etree import ElementTree as ET
import zipfile
path = '/path/to/sample.docx'
zf = zipfile.ZipFile(path, 'r')
root = ET.fromstring(zf.read('word/document.xml'))
print(root.tag, root.attrib) # =>
# ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}document', {})

An XML namespace declaration (a thing starting with xmlns:) is not an attribute. I think that's why you're not seeing it appear in the attrib dictionary. There are other ways of working with namespaces, so if you can say more about the purposes you're working to serve I may be able to be of more help.
The namespaces (and their prefixes) of WordprocessingML elements are well known and documented, and relatively few in number. There are some tens at most and only a small handful that appear in most documents. So depending on what you're trying to accomplish it may be easier to get done than it might seem.

Related

Parsing XML in python: selecting an attribute given that a child node has a specific attribute

Given the xml
xmlstr = '''
<myxml>
<Description id="10">
<child info="myurl"/>
</Description>
</myxml>'
I'd like to get the id of Description only where child has an attribute of info.
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
a = root.find(".//Description/[child/#info]")
print(a.attrib)
and changing the find to .//Description/[child[#info]]
both return an error of:
SyntaxError: invalid predicate
I know that etree only supports a subset of xpath, but this doesn't seem particularly weird - should this work? If so, what have I done wrong?!
Changing the find to .//Description/[child] does work, and returns
{'id': '10'}
as expected

You've definitely hit that XPath limited support limitation as, if we look at the source directly (looking at 3.7 source code), we could see that while parsing the Element Path expression, only these things in the filters are considered:
[#attribute] predicate
[#attribute='value']
[tag]
[.='value'] or [tag='value']
[index] or [last()] or [last()-index]
Which means that both of your rather simple expressions are not supported.
If you really want/need to stick with the built-in ElementTree library, one way to solve this would be with finding all Description tags via .findall() and filtering the one having a child element with info attribute.

You can also get those values as keys, which makes it a bit more structured approach to gather data:
import xml.etree.ElementTree as ET
root = ET.fromstring(xmlstr)
wht =root.find(".//Description")
wht.keys() #--> ['id']
wht.get('id') # --> '10'

Parsing Google Earth KML file in Python (lxml, namespaces)

I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).
As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:
from lxml import etree
tree=etree.parse('kmlfile')
Here is the example from the tutorial I am trying to emulate:
If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:
for element in root.getiterator("child"):
print element.tag, '-', element.text
I would like to get all data under 'Placemark', so I tried
for i in tree.getiterterator("Placemark"):
print i, type(i)
which doesn't give me anything. What does work is:
for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
print i, type(i)
I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand
how the part in {} relates to my specific example at all
why it is different from the tutorial
and what I am doing wrong
Any help is much appreciated!

Here is my solution.
So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.
We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.
###Problem
Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:
<Placemark> <name>CITY NAME</name>
The XPath equivalent to the non-functional code I posted above is:
tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')
Where the text() part is needed to get the text contained in the location //Placemark/name.
Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:
xmlns="http://www.opengis.net/kml/2.2"
###Solution
We can supply namespaces to xpath by setting the namespaces argument:
xpath(X, namespaces={prefix: namespace})
This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".
However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:
tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})
The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.

python etree with xpath and namespaces with prefix

I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?

You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")

Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.

With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!

Parse xbrl file in python

I am working on a xml parser.
The goal is to parse a number of different xml files where prefixes and tags remain consistent but namespaces change.
I am hence trying either:
to parse the xml just by <prefix:tags> without resolving (replacing) the prefix with the namespace. The prefixes remain unchanged from document to document.
to load automatically the namespaces so that the identifier (<prefix:tag>) could be replaced with the proper namespace.
just parse the xml by tag
I have tried with xml.etree.ElementTree.
I also had a look at lxml
I did not find any configuration option of the XMLParser in lxml that could help me out although here I could read an answer where the author suggests that lxml should be able to collect namespaces for me automatically.
Interestingly, parsed_file = etree.XML(file) fails with the error:
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
One example of the files I would like to parse is here

Do not care about ns prefixes, care about complete namespaces
Sometime, people do care about those short prefixes and forgetting, the are of secondary importance. They are only short reference to fully qualified namespace. E.g.
xmlns:trw="http://www.trw.com/20131231"
in xml means, from now on, the "trw:" stands for fully qualified namespace "http://www.trw.com/20131231". Note, that this prefix can be redefined to any other namespace in any following element and may get completely different meaning.
On the other hand, when you care about real meaning, what means here fully qualified namespace, you may think of "trw:row" as "{http://www.trw.com/20131231}row". This translated meaning will be reliable and will not change with prefix changes.
Parsing referred xml
The link to http://edgar.sec.gov/Archives/edgar/data/1267097/000104746914000925/trw-20131231.xml leads to an xml, which validates by xmlstarlet and which lxml is able to parse.
The error message you show is referring to very first character of the stream, so chances are you either met BOM byte in your file, or you are trying to read xml, which is gzipped and shall be decompressed first.
lxml and namespaces
lxml works with namespaces well. It allows you to use XPath expressions, which use namespaces. With controlling namspace prefix on output it is a bit more complex, as it is dependent on xmlns attributes, which are part of serialized document. If you want to modify the prefixes, you must somehow organize these xmlns attributes, often by moving all of the to the root element. At the same time, lxml keeps track of fully qualified namespace of each element, so at the moment of serialization, it will respect this full name as well as currently valid prefix for this namespace.
Handling these xmlna attributes is a bit of more code, refer to lxml documentation.

items = tree.xpath("*[local-name(.) = 'a_tag_goes_here']")
did the job. On top of that I had to browse the generated list items manually to define my other desired filtering functions.

Python etree control empty tag format

When creating an XML file with Python's etree, if we write to the file an empty tag using SubElement, I get:
<MyTag />
Unfortunately, our XML parser library used in Fortran doesn't handle this even though it's a correct tag. It needs to see:
<MyTag></MyTag>
Is there a way to change the formatting rules or something in etree to make this work?

As of Python 3.4, you can use the short_empty_elements argument for both the tostring() function and the ElementTRee.write() method:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), short_empty_elements=False)
b'<mytag></mytag>'
In older Python versions, (2.7 through to 3.3), as a work-around you can use the html method to write out the document:
>>> from xml.etree import ElementTree as ET
>>> ET.tostring(ET.fromstring('<mytag/>'), method='html')
'<mytag></mytag>'
Both the ElementTree.write() method and the tostring() function support the method keyword argument.
On even earlier versions of Python (2.6 and before) you can install the external ElementTree library; version 1.3 supports that keyword.
Yes, it sounds a little weird, but the html output mostly outputs empty elements as a start and end tag. Some elements still end up as empty tag elements; specifically <link/>, <input/>, <br/> and such. Still, it's that or upgrade your Fortran XML parser to actually parse standards-compliant XML!

This was directly solved in Python 3.4. From then, the write method of xml.etree.ElementTree.ElementTree has the short_empty_elements parameter which:
controls the formatting of elements that contain no content. If True (the default), they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags.
More details in the xml.etree documentation.

Adding an empty text is another option:
etree.SubElement(parent, 'child_tag_name').text=''
But note that this will change not only the representation but also the structure of the document: i.e. child_el.text will be '' instead of None.
Oh, and like Martijn said, try to use better libraries.

If you have sed available, you could pipe the output of your python script to
sed -e "s/<\([^>]*\) \/>/<\1><\/\1>/g"
Which will find any occurence of <Tag /> and replace it by <Tag></Tag>

Paraphrasing the code, the version of ElementTree.py I use contains the following in a _write method:
write('<' + tagname)
...
if node.text or len(node): # this line is literal
write('>')
...
write('</%s>' % tagname)
else:
write(' />')
To steer the program counter I created the following:
class AlwaysTrueString(str):
def __nonzero__(self): return True
true_empty_string = AlwaysTrueString()
Then I set node.text = true_empty_string on those ElementTree nodes where I want an open-close tag rather than a self-closing one.
By "steering the program counter" I mean constructing a set of inputs—in this case an object with a somewhat curious truth test—to a library method such that the invocation of the library method traverses its control flow graph the way I want it to. This is ridiculously brittle: in a new version of the library, my hack might break—and you should probably treat "might" as "almost guaranteed". In general, don't break abstraction barriers. It just worked for me here.

If you have python >=3.4, use the short_empty_elements=Falseoption as has been shown in other answers already, but:
If you have the XML in string form already and can't touch the code
where it's generated..
If you're in a situation where you are stuck with python <3.4..
If you're using a different XML library that insists on self-closing tags..
Then this works:
xml = "<foo/><bar/>"
xml = re.sub(r'<([^\/]+)\/\>', r'<\1></\1>', xml)
print(xml)
# output will be
# <foo></foo><bar></bar>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, OOXML, ElementTree and document root attributes - python

Related

Parsing XML in python: selecting an attribute given that a child node has a specific attribute

Parsing Google Earth KML file in Python (lxml, namespaces)

python etree with xpath and namespaces with prefix

Parse xbrl file in python

Python etree control empty tag format

Categories

Resources