Python: Parsing XML autoadd all key/value pairs - python

I searched a long and have tried a lot! but I can't get my mind open for this totally easy scenario. I need to say that I'm a python newbie but a very good bash coder ;o) I have written some code with python but maybe there is a lot I need to learn yet so do not be too harsh to me ;o) I'm willing to learn and I read python docs and many examples and tried a lot on my own but now I'm at a point where I picking in the dark..
I parse content provided as XML. It is about 20-50 MB big.
My XML Example:
<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>
(how many levels can not be said / can change)
</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>..
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
...
and so on
</MAIN>
This is the main part of parsing it (if you need more details pls ask):
from lxml import etree
resp = my.request(some call args)
xml = etree.XML(resp)
for element in xml.findall(".//MAIN"):
# this works fine but is not generic enough:
my_dict = OrderedDict()
for only1sub in element.iter(tag="SUBEL2"):
for i in only1sub:
my_dict[i.tag] = i.text
This just working fine with 1 subelement but that means I need to know which one in the tree has subelements and which not. This could change in the future or be added.
Another problem is MULTISUB. With the above code I'm able to parse until the first tag only.
The goal
What I WANT to achieve is - at best:
A) Having one function / code snippet which is able to parse the whole XML content and if there is a subelement (e.g. with "if len(x)" or whatever) then parse to the next level until you reach a level without a subelement/tree. Then go on to B)
B) For each XML tag found which has NO subelements I want to update the dictionary with the tag name and the tag text.
C) I want to do that for all available elements - the tag and the direct child tag names (e.g. "NOSUBEL2" or "MULTISUB") will not change (often) so it will be ok to use them as a start point for parsing.
What I tried so far was to chain several loops like for and while and for again and so on but nothing was full successful. I also dived my hands into python generators because I thought I can do something with the next() function but also nothing. But again I may have not the knowledge to use them correctly and so I'm happy for every answer..
At the end the thing I need is so easy I believe. I only want to have key value pairs from the tag name and the tag content that couldn't be so hard? Any help greatly appreciated..
Can you help me reaching the goal?
(Already a thanks for reading until here!)

What you are looking for is the recursion - a technique of running some procedure inside that procedure, but for sub-problem of the original problem. In this case: either, for each subelement of some element run this procedure (in case there are subelements) or update your dictionary with element's tag name and text.
I assume at the end you're interested in having dictionary (OrderedDict) containing "flat representation" of whole element tree's leaves' (nodes without subelements) tag names/text values, which in your case, printed out, would look like this:
OrderedDict([('NOSUBEL', 'abcd'), ('NOSUBEL2', 'adasdasa'), ('ANOTHERONE', '(how many levels can not be said / can change)'), ('FOO', 'abcdefg'), ('NOSUBEL3', 'abc')])
Generally, you would define a function that will either call itself with part of your data (in this case: subelements, if there are any) or do something (in this case: update some instance of dictionary).
Since I don't know the details behind my.request call, I've replaced that by parsing from string containing valid XML, based on the one you provided. Just replace constructing the tree object.
resp = """<MAIN>
<NOSUBEL>abcd</NOSUBEL>
<NOSUBEL2>adasdasa</NOSUBEL2>
<MULTISUB>
<WHATEVER>
<ANOTHERSUBEL>
<ANOTHERONE>(how many levels can not be said / can change)</ANOTHERONE>
</ANOTHERSUBEL>
</WHATEVER>
</MULTISUB>
<SUBEL2>
<FOO>abcdefg</FOO>
</SUBEL2>
<NOSUBEL3>abc</NOSUBEL3>
</MAIN>"""
from collections import OrderedDict
from lxml import etree
def update_dict(element, my_dict):
# lxml defines "length" of the element as number of its children.
if len(element): # If "length" is other than 0.
for subelement in element:
# That's where the recursion happens. We're calling the same
# function for a subelement of the element.
update_dict(subelement, my_dict)
else: # Otherwise, subtree is a leaf.
my_dict[element.tag] = element.text
if __name__ == "__main__":
# Change/amend it with your my.request call.
tree = etree.XML(resp) # That's a <MAIN> element, too.
my_dict = OrderedDict()
# That's the first invocation of the procedure. We're passing entire
# tree and instance of dictionary.
update_dict(tree, my_dict)
print(my_dict) # Just to see that dictionarty was filled with values.
As you can see, I didn't use any tag name in the code (except for the XML source, of course).
I've also added missing import from collections.

Related

Parsing XML with ElementTree's iter() with no argument, does not return the first several tags in file

I am trying to extract all of the headers from an XML file and put them into a list in python, however, every time I run my code the first tag extracted from the file is not actually first tag in the XML file. It instead begins with the 18th tag and then prints the remainder of the list from there. The really weird part is when I originally wrote this code, it worked as expected, but as I added code to extract the element text and put it in a list, the header code stopped working, both in the original program and the standalone code below. I should also mention the complete program does not manipulate the XML file in any way. All manipulation is done exclusively on the python lists after the extraction.
import xml.etree.ElementTree as ET
tree = ET.parse("Sample.xml")
root = tree.getroot()
headers = [elem.tag for elem in root.iter()]
print(headers)
Sample.XML is a sensitive file so I had to redact all the element text. It is also a very large file so I only included one account's worth of elements.
-<ExternalCollection xmlns="namespace.xsd">
-<Batch>
<BatchID>***</BatchID>
<ExternalCollectorName>***</ExternalCollectorName>
<PrintDate>***</PrintDate>
<ProviderOrganization>***</ProviderOrganization>
<ProvOrgID>***</ProvOrgID>
-<Account>
<AccountNum>***</AccountNum>
<Guarantor>***</Guarantor>
<GuarantorAddress1>***</GuarantorAddress1>
<GuarantorAddress2/>
<GuarantorCityStateZip>***</GuarantorCityStateZip>
<GuarantorEmail/>
<GuarantorPhone>***</GuarantorPhone>
<GuarantorMobile/>
<GuarantorDOB>***</GuarantorDOB>
<AccountID>***</AccountID>
<GuarantorID>***</GuarantorID>
-<Incident>
<Patient>***</Patient>
<PatientDOB>***</PatientDOB>
<FacilityName>***</FacilityName>
-<ServiceLine>
<DOS>***</DOS>
<Provider>***</Provider>
<Code>***</Code>
<Modifier>***</Modifier>
<Description>***</Description>
<Billed>***</Billed>
<Expected>***</Expected>
<Balance>***</Balance>
<SelfPay>***</SelfPay>
<IncidentID>***</IncidentID>
<ServiceLineID>***</ServiceLineID>
-<OtherActivity>
</OtherActivity>
</ServiceLine>
</Incident>
</Account>
</Batch>
</ExternalCollection>
The output is as follows:
'namespace.xsd}PatientDOB', '{namespace.xsd}FacilityName', '{namespace.xsd}ServiceLine', '{namespace.xsd}DOS', '{namespace.xsd}Provider', '{namespace.xsd}Code', '{namespace.xsd}Modifier', '{namespace.xsd}Description', '{namespace.xsd}Billed', '{namespace.xsd}Expected', '{namespace.xsd}Balance', '{namespace.xsd}SelfPay', '{namespace.xsd}IncidentID', '{namespace.xsd}ServiceLineID', '{namespace.xsd}OtherActivity'
As you can see, for some reason the first returned value is Patient DOB instead of the actual first tag.
Thank y'all in advance!
Your input file should not contain "-" chars in front of XML tags.
You should drop at least the first "-", in front of the root tag, otherwise
a parsing error occurs.
Note also that your first printed tag name has no initial "{", so apparently
something weird is going on with your list, presumably, after your loop.
I ran your code and got a proper list, containing all tags.
Try the following loop:
for elem in root.iter():
print(elem.tag)
Maybe it will give you some clue about the real cause of your error.
Consider also upgrading your Python installation. Maybe you have
some outdated modules.
Yet another hint: Run your code on just this input that you included
in your post, with content replaced with "***". Maybe the real cause
of your error is in the actual content of any source element
(which you replaced here with asterixes).

access elements and attribs DIRECTLY using lxml etree

Given the following xml structure:
<root>
<a>
<from name="abc">
<b>xxx</b>
<c>yyy</c>
</from>
<to name="def">
<b>blah blah</b>
<c>another blah blah</c>
</to>
</a>
</root>
How can I access directly the value of "from.b" of each "a" without loading first "from" (with find()) of each "a"?
As you can see there are exactly the same elements under "from" and "to". So the method findall() would not work as I have to differentiate where the value of "b" is coming from.
I would like to get the method of direct access because if I have to load each child element (there is a lot) my code would be quite verbose. And in addition in my case performance counts and I have a lot of XML docs to parse! So I have to find the fastest method to go through the document (and store the data into a DB)
Within each "a" element there is exactly 1 "from" element and within each "from" element there is exactly 1 "b" element.
I have no problem to do this with lxml objectify, but I want to use etree because first I have to parse the XML document with etree because I have to validate first the xml schema against an XSD doc and I do not want to reparse the whole document again.
find (and findall) lets you specify a path to elements as well, for example you can do:
root = ET.fromstring(input_xml)
for a in root.findall('a'):
print(a, a.find('from/b').text)
assuming you do always have exactly one from and b element.
otherwise, I might be tempted to use findall and do checks in Python code if this is designed to be more robust

Parsing Google Earth KML file in Python (lxml, namespaces)

I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).
As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:
from lxml import etree
tree=etree.parse('kmlfile')
Here is the example from the tutorial I am trying to emulate:
If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:
for element in root.getiterator("child"):
print element.tag, '-', element.text
I would like to get all data under 'Placemark', so I tried
for i in tree.getiterterator("Placemark"):
print i, type(i)
which doesn't give me anything. What does work is:
for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
print i, type(i)
I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand
how the part in {} relates to my specific example at all
why it is different from the tutorial
and what I am doing wrong
Any help is much appreciated!
Here is my solution.
So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.
We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.
###Problem
Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:
<Placemark> <name>CITY NAME</name>
The XPath equivalent to the non-functional code I posted above is:
tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')
Where the text() part is needed to get the text contained in the location //Placemark/name.
Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:
xmlns="http://www.opengis.net/kml/2.2"
###Solution
We can supply namespaces to xpath by setting the namespaces argument:
xpath(X, namespaces={prefix: namespace})
This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".
However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:
tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})
The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.

python etree with xpath and namespaces with prefix

I can't find info, how to parse my XML with namespace:
I have this xml:
<par:Request xmlns:par="http://somewhere.net/actual">
<par:actual>blabla</par:actual>
<par:documentType>string</par:documentType>
</par:Request>
And tried to parse it:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
for subtag in rootxml.xpath(u'//par:actual'):
#do something
print(subtag)
And got exception, because it doesn't know about namespace prefix.
Is there best way to solve that problem, counting that script will not know about file it going to parse and tag is going to search for?
Searching web and stackoverflow I found, that if I will add there:
namespace = {u'par': u"http://somewhere.net/actual"}
for subtag in rootxml.xpath(u'//par:actual', namespaces=namespace):
#do something
print(subtag)
That works. Perfect. But I don't know which XML I will parse, and searching tag (such as //par:actual) is also unknown to my script. So, I need to find way to extract namespace from XML somehow.
I found a lot of ways, how to extract namespace URI, such as:
print(rootxml.tag)
print(rootxml.xpath('namespace-uri(.)'))
print(rootxml.xpath('namespace-uri(/*)'))
But how should I extract prefix to create dictionary which ElementTree wants from me? I don't want to use regular expression monster over xml body to extract prefix, I believe there have to exist supported way for that, isn't it?
And maybe there have to exist some methods for me to extract by ETree namespace from XML as dictionary (as ETree wants!) without hands manipulation?
You cannot rely on the namespace declarations on the root element: there is no guarantee that the declarations will even be there, or that the document will have the same prefix for the same namespace throughout.
Assuming you are going to have some way of passing the tag you want to search (because you say it is not known by your script), you should also provide a way to pass a namespace mapping as well. Or use the James Clark notation, like {http://somewhere.net/actual}actual (the ETXPath has support for this syntax, whereas "normal" xpath does not, but you can also use other methods like .findall() if you don't need full xpath)
If you don't care for the prefix at all, you could also use the local-name() function in xpath, eg. //*[local-name()="actual"] (but you won't be "really" sure it's the right "actual")
Oh, I found it.
After we do that:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
Object rootxml contains dictionary nsmap, which contains all namespaces that I want.
So, simplest solution I've found:
dom = ET.parse(u'C:\\filepath\\1.xml')
rootxml = dom.getroot()
nss = rootxml.nsmap
for subtag in rootxml.xpath(u'//par:actual', namespaces=nss):
#do something
print(subtag)
That works.
UPD: that works if user understand what means 'par' in XML he works with. For example, comparing supposed namespace with existing namespace before any other operations.
Still, I like much variant with XPath that understands {...}actual, that was what I tried to achieve.
With Python 3.8.2 I found this question with the same issue.
This is the solution I found, put the namespace in the XPath query. (Between the {})
ApplicationArea = BOD_IN_tree.find('.//ApplicationArea', ns)
if(ApplicationArea is None):
ApplicationArea = BOD_IN_tree.find('.//{http://www.defaultNamespace.com/2}ApplicationArea', ns)
I search for the element without the namespace, then search again if it's not found. I have no control over the inbound documents, some have namespaces, some do not.
I hope this helps!

Accessing Nested Tag Item with getElementsByTagName

If the same tag name is used in multiple places within an xml file with the nesting providing unqiueness, what is the best way to specify the particular node of interest.
from xml.dom.minidom import parse
dom = parse("inputs.xml")
data_node = dom.getElementsByTagName("outer_level_x")[0].getElementsByTagName('inner_level_y')[0].getElementsByTagName('Data')
So, is there a better way to specify the "Data" node nested under "<outer_level_x><inner_level_y>"? The specific nesting is always known and a function which recurses calling getElementsByTagName could be written; but, I suspect that I am missing something basic here.
xml.etree.ElementTree provides support for XPath syntax when calling find/findall. Thus, allowing precision when specifying desired tags/attributes.

Categories