Accessing Nested Tag Item with getElementsByTagName - python

If the same tag name is used in multiple places within an xml file with the nesting providing unqiueness, what is the best way to specify the particular node of interest.
from xml.dom.minidom import parse
dom = parse("inputs.xml")
data_node = dom.getElementsByTagName("outer_level_x")[0].getElementsByTagName('inner_level_y')[0].getElementsByTagName('Data')
So, is there a better way to specify the "Data" node nested under "<outer_level_x><inner_level_y>"? The specific nesting is always known and a function which recurses calling getElementsByTagName could be written; but, I suspect that I am missing something basic here.

xml.etree.ElementTree provides support for XPath syntax when calling find/findall. Thus, allowing precision when specifying desired tags/attributes.

Related

Finding A Child Element If It Exists When Manipulating PowerPoint XML With python-pptx

In md2pptx - which uses python-pptx to turn Markdown into PowerPoint - I've implemented a few functions that manipulate the XML tree.
In a few places I need to find a child element if it exists - and create it if it doesn't.
I have a rather hacky way of searching for this element. I'd rather have a decent way.
So, could someone post me the "right" way to search for a child element's existence.
There's probably a more general version of this question - how to manipulate XML in the context of python-pptx. I could use a reference for that, too. (Yes, I can read the python-pptx code and often do - but a synopsis would help me get it right.)
Using XPath for this job is almost always the right answer.
For example, if you wanted to get all the a:fld child elements of a paragraph to implement something to do with text fields:
# --- get <a:p> XML element of paragraph ---
p = paragraph._p
# --- use XPath to get all the `<a:fld>` child elements ---
flds = p.xpath("./a:fld")
# --- do something with them ---
for fld in flds:
do_fieldy_thing(fld)
The result of an .xpath() call is a list of the zero-or-more items that matched the str XPath expression provided as its argument. If there can only be zero or one result it's common to process it like this instead:
if flds:
do_fieldy_thing(flds[0])
The complication arises when the "starting" element (p in this case) is not a defined oxml element. oxml is a layer of custom element classes added by python-pptx "on top of" the base lxml.etree._Element class for each XML element. These custom element classes provide some convenience services, in particular allowing you to specify elements using their namespace prefixes (like "a:fld" in this case).
Not all elements in python-pptx have a custom element class, only those we manipulate via the API in some way. Any element you get from a python-pptx object (like paragraph._p above) will be oxml elements, but the elements returned by the .xpath() calls very likely won't be (otherwise you would have used python-pptx to get them). Elements that are not oxml elements are plain lxml.etree._Element instances.
The .xpath() implementation on an lxml.etree._Element instance requires use of so-called "Clark names" which look something like: "{http://schemas.openxmlformats.org/drawingml/2006/main}fld" instead of "a:fld".
You can create a Clark-name from a namespace-prefixed tag name using the pptx.oxml.ns.qn() function:
>>> from pptx.oxml.ns import qn
>>> qn("a:fld")
'{http://schemas.openxmlformats.org/drawingml/2006/main}fld'

Find all elements in ElementTree by attribute using Python

I have an xml, which has a lot of different nodes with the different tags, but the same attribute. Is it possible to find all these nodes?
I know, that it is possible to find all nodes by attribute, if they all have the same tag:
root.findall(".//tag[#attrib]")
but in my case they all have different tags. Something like this is not working:
root.findall(".//[#attrib]")
In XPath you can use * to reference element of any name, and you can use #* to reference attribute of any name :
root.findall(".//*[#attrib]")
side notes :
As a heads up, if you're really using lxml (not just accidentally tagged the question with lxml), I would suggest to use xpath() method instead of findall(). The former has much better XPath support. For example, when you need to find element of limited set of names, say foo and bar, you can use the following XPath expression with xpath() method :
root.xpath("//*[self::foo or self::bar][#attrib]")
The same expression above when passed to findall() will result in an error :
SyntaxError: prefix 'self' not found in prefix map

Python: Parsing XML autoadd all key/value pairs

I searched a long and have tried a lot! but I can't get my mind open for this totally easy scenario. I need to say that I'm a python newbie but a very good bash coder ;o) I have written some code with python but maybe there is a lot I need to learn yet so do not be too harsh to me ;o) I'm willing to learn and I read python docs and many examples and tried a lot on my own but now I'm at a point where I picking in the dark..
I parse content provided as XML. It is about 20-50 MB big.
My XML Example:
<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>
(how many levels can not be said / can change)
</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>..
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
...
and so on
</MAIN>
This is the main part of parsing it (if you need more details pls ask):
from lxml import etree
resp = my.request(some call args)
xml = etree.XML(resp)
for element in xml.findall(".//MAIN"):
# this works fine but is not generic enough:
my_dict = OrderedDict()
for only1sub in element.iter(tag="SUBEL2"):
for i in only1sub:
my_dict[i.tag] = i.text
This just working fine with 1 subelement but that means I need to know which one in the tree has subelements and which not. This could change in the future or be added.
Another problem is MULTISUB. With the above code I'm able to parse until the first tag only.
The goal
What I WANT to achieve is - at best:
A) Having one function / code snippet which is able to parse the whole XML content and if there is a subelement (e.g. with "if len(x)" or whatever) then parse to the next level until you reach a level without a subelement/tree. Then go on to B)
B) For each XML tag found which has NO subelements I want to update the dictionary with the tag name and the tag text.
C) I want to do that for all available elements - the tag and the direct child tag names (e.g. "NOSUBEL2" or "MULTISUB") will not change (often) so it will be ok to use them as a start point for parsing.
What I tried so far was to chain several loops like for and while and for again and so on but nothing was full successful. I also dived my hands into python generators because I thought I can do something with the next() function but also nothing. But again I may have not the knowledge to use them correctly and so I'm happy for every answer..
At the end the thing I need is so easy I believe. I only want to have key value pairs from the tag name and the tag content that couldn't be so hard? Any help greatly appreciated..
Can you help me reaching the goal?
(Already a thanks for reading until here!)
What you are looking for is the recursion - a technique of running some procedure inside that procedure, but for sub-problem of the original problem. In this case: either, for each subelement of some element run this procedure (in case there are subelements) or update your dictionary with element's tag name and text.
I assume at the end you're interested in having dictionary (OrderedDict) containing "flat representation" of whole element tree's leaves' (nodes without subelements) tag names/text values, which in your case, printed out, would look like this:
OrderedDict([('NOSUBEL', 'abcd'), ('NOSUBEL2', 'adasdasa'), ('ANOTHERONE', '(how many levels can not be said / can change)'), ('FOO', 'abcdefg'), ('NOSUBEL3', 'abc')])
Generally, you would define a function that will either call itself with part of your data (in this case: subelements, if there are any) or do something (in this case: update some instance of dictionary).
Since I don't know the details behind my.request call, I've replaced that by parsing from string containing valid XML, based on the one you provided. Just replace constructing the tree object.
resp = """<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>(how many levels can not be said / can change)</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
</MAIN>"""
from collections import OrderedDict
from lxml import etree
def update_dict(element, my_dict):
# lxml defines "length" of the element as number of its children.
if len(element): # If "length" is other than 0.
for subelement in element:
# That's where the recursion happens. We're calling the same
# function for a subelement of the element.
update_dict(subelement, my_dict)
else: # Otherwise, subtree is a leaf.
my_dict[element.tag] = element.text
if __name__ == "__main__":
# Change/amend it with your my.request call.
tree = etree.XML(resp) # That's a <MAIN> element, too.
my_dict = OrderedDict()
# That's the first invocation of the procedure. We're passing entire
# tree and instance of dictionary.
update_dict(tree, my_dict)
print(my_dict) # Just to see that dictionarty was filled with values.
As you can see, I didn't use any tag name in the code (except for the XML source, of course).
I've also added missing import from collections.

python etree with xpath and namespaces with prefix

I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?
You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")
Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.
With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!

python lxml adds unused namespaces

I'm having an issue when using lxml's find() method to select a node in an xml file. Essentially I am trying to move a node from one xml file to another.
File 1:
<somexml xmlns:a='...' xmlns:b='...' xmlns:c='...'>
<somenode id='foo'>
<something>bar</something>
</somenode>
</somexml>
Once I parse File 1 and do a find on it:
node = tree.find('//*[#id="foo"]')
Node looks like this:
<somenode xmlns:a='...' xmlns:b='...' xmlns:c='...'>
<something>bar</something>
</somenode>
Notice it added the namespaces that were found in the document to that node. However, nothing in that node uses any of those namespaces. How would I go about either A) not writing namespaces that aren't used in the selected node, or B) removing unused name space declarations? If it's being used in the selected node then I will need it, but otherwise, I would like to get rid of them. Any ideas? Thanks!
If the namespaces are in the document, then the document uses the namespaces. The namespaces are being used in those nodes, because those nodes are part of the subtree which declared the namespace. Follow the link given by Daenyth to remove them, or strip them off the XML string before you turn it into an lxml object.

Categories