Comments in XML at beginning of document - python

my PYTHON xml parser fails if there´s a comment at the beginnging of an xml file like::
<?xml version="1.0" encoding="utf-8"?>
<!-- Script version: "1"-->
<!-- Date: "07052010"-->
<component name="abc">
<pp>
....
</pp>
</component>
is it illegal to place a comment like this?
EDIT:
well it´s not throwing an error but the DOM module will fail and not recognize the child nodes:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
for component in sub_tree.firstChild.childNodes:
print(component)
I cannot acces the child nodes; sub_tree.firstChild.childNodes returns an empty list,but if I remove those 2 comments I can loop through the list and read the childnodes as usual!
EDIT:
Guys, this simple example is working and enough to figure it out. start your python shell and execute this small code above. Once it will output nothing and after deleting the comments it will show up the node!

If you do this:
import xml.dom.minidom as dom
sub_tree = dom.parse('xyz.xml')
print sub_tree.children
You will see what is your problem:
>>> print sub_tree.childNodes
[<DOM Comment node " Script ve...">, <DOM Comment node " Date: "07...">, <DOM Element: component at 0x7fecf88c>]
firstChild will obviously pick up the first child, which is a comment and doesn't have any children of its own.
You could iterate over the children and skip all comment nodes.
Or you could ditch the DOM model and use ElementTree, which is so much nicer to work with. :)

It is legal; from XML 1.0 Reference:
2.5 Comments
[Definition: Comments may appear
anywhere in a document outside other
markup; in addition, they may appear
within the document type declaration
at places allowed by the grammar. They
are not part of the document's
character data; an XML processor MAY,
but need not, make it possible for an
application to retrieve the text of
comments. For compatibility, the
string " -- " (double-hyphen) MUST NOT
occur within comments.] Parameter
entity references MUST NOT be
recognized within comments.

To get better answers, show us (a) a small complete Python script and (b) a small complete XML document that together demonstrate the unexpected behaviour.
Have you considered using ElementTree?

That should be legal as long as the XML declaration is on the first line.

Related

Parsing Google Earth KML file in Python (lxml, namespaces)

I am trying to parse a .kml file into Python using the xml module (after failing to make this work in BeautifulSoup, which I use for HTML).
As this is my first time doing this, I followed the official tutorial and all goes well until I try to construct an iterator to extract my data by root iteration:
from lxml import etree
tree=etree.parse('kmlfile')
Here is the example from the tutorial I am trying to emulate:
If you know you are only interested in a single tag, you can pass its name to getiterator() to have it filter for you:
for element in root.getiterator("child"):
print element.tag, '-', element.text
I would like to get all data under 'Placemark', so I tried
for i in tree.getiterterator("Placemark"):
print i, type(i)
which doesn't give me anything. What does work is:
for i in tree.getiterterator("{http://www.opengis.net/kml/2.2}Placemark"):
print i, type(i)
I don't understand how this comes about. The www.opengis.net is listed in the tag at the beginning of the document (kml xmlns="http://www.opengis.net/kml/2.2"...) , but I don't understand
how the part in {} relates to my specific example at all
why it is different from the tutorial
and what I am doing wrong
Any help is much appreciated!
Here is my solution.
So, the most important thing to do is read this as posted by Tomalak. It's a really good description of namespaces and easy to understand.
We are going to use XPath to navigate the XML document. Its notation is similar to file systems, where parents and descendants are separated by slashes /. The syntax is explained here, but note that some commands are different for the lxml implementation.
###Problem
Our goal is to extract the city name: the content of <name> which is under <Placemark>. Here's the relevant XML:
<Placemark> <name>CITY NAME</name>
The XPath equivalent to the non-functional code I posted above is:
tree=etree.parse('kml document')
result=tree.xpath('//Placemark/name/text()')
Where the text() part is needed to get the text contained in the location //Placemark/name.
Now this doesn't work, as Tomalak pointed out, cause the name of these two nodes are actually {http://www.opengis.net/kml/2.2}Placemark and {http://www.opengis.net/kml/2.2}name. The part in curly brackets is the default namespace. It does not show up in the actual document (which confused me) but it is defined at the beginning of the XML document like this:
xmlns="http://www.opengis.net/kml/2.2"
###Solution
We can supply namespaces to xpath by setting the namespaces argument:
xpath(X, namespaces={prefix: namespace})
This is easy enough for the namespaces that have actual prefixes, in this document for instance <gx:altitudeMode>relativeToSeaFloor</gx:altitudeMode> where the gx prefix is defined in the document as xmlns:gx="http://www.google.com/kml/ext/2.2".
However, Xpath does not understand what a default namespace is (cf docs). Therefore, we need to trick it, like Tomalak suggested above: We invent a prefix for the default and add it to our search terms. We can just call it kml for instance. This piece of code actually does the trick:
tree.xpath('//kml:Placemark/kml:name/text()', namespaces={"kml":"http://www.opengis.net/kml/2.2"})
The tutorial mentions that there is also an ETXPath method, that works just like Xpath except that one writes the namespaces out in curly brackets instead of defining them in a dictionary. Thus, the input would be of the style {http://www.opengis.net/kml/2.2}Placemark.

Python xml - handle unclosed token

I am reading in hundreds of XML files and parsing them with xml.etree.ElementTree.
Quick background just fwiw:
These XML files were at one point totally valid but somehow when processing them historically my process which copied/pasted them may have corrupted them. (Turns out it was a flushing issue / with statement not closing, if you care, see the good help I got on that investigation at... Python shutil copyfile - missing last few lines ).
Anyway back to the point of this question.
I would still like to read in the first 100,000 lines or so of these documents which are valid XML. The files are only missing the last 4 or 5KB of a 6MB file. As alluded to earlier, though, the file just 'cuts out'. it looks like this:
</Maintag>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
where (perhaps obviously) Scheduled_E is the beginning of what should be another attribute, <.Scheduled_Event>, say. But the file gets cut short mid tag. Once again, before this point in the file, there are several thousand 'good' "Maintag" entries which I would like to read in, accepting the cutoff entry (and obviously anything that should have come after) as an unrecoverable fail.
A simple but incomplete method of dealing with this might be to simply - pre XML processing - look for the last instance of the string <./Maintag> in the file, and replace what follows (which will be broken, at some point) with the 'opening' tags. Again, this at least lets me process what is still there and valid.
If someone wants to help me out with that sort of string replacement, then fwiw the opening tags are:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<FirstTag>
<Source FileName="myfile">
I am hoping that even easier than that, there might be an elementtree or beautifulsoup or other way of handling this situation... I've done a decent amount of searching and nothing seems easy/obvious.
Thanks
For dealing with unclosed elements -or token as in the title of this questioin-, I'd recommend to try lxml. lxml's XMLParser has recover option which documented as :
recover - try hard to parse through broken XML
For example, given a broken XML as follow :
from lxml import etree
xml = """
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E
"""
parser = etree.XMLParser(recover=True)
doc = etree.fromstring(xml, parser=parser)
print(etree.tostring(doc))
The recovered XML as printed by the above code is as follow :
<root>
<Maintag>
<Change_type>NQ</Change_type>
<Name>Atlas</Name>
<Test>ATLS</Test>
<Other>NYSE</Other>
<Scheduled_E/></Maintag></root>

Parsing XML data within tags using lxml in python

My question is regarding how to get information stored in a tag which allows for no closing tag. Here's the relevant xml:
<?xml version="1.0" encoding="UTF-8"?>
<uws:job>
<uws:results>
<uws:result id="2014-03-03T15:42:31:1337" xlink:href="http://www.cosmosim.org/query/index/stream/table/2014-03-03T15%3A42%3A31%3A1337/format/csv" xlink:type="simple"/>
</uws:results>
</uws:job>
I'm looking to extract the xlink:href url here. As you can see the uws:result tag requires no closing tag. Additionally, having the 'uws:' makes it a bit tricky to handle them when working in python. Here's what I've tried so far:
from lxml import etree
root = etree.fromstring(xmlresponse.content)
url = root.find('{*}results').text
Where xmlresponse.content is the xml data to be parsed. What this returns is
'\n '
which indicates that it's only finding the newline character, since what I'm really after is contained within a tag inside the results tag. Any ideas would be greatly appreciated.
You found the right node; you extracted the data incorrectly. Instead of
url = root.find('{*}results').text
you really want
url = root.find('{*}results').get('attribname', 'value_to_return_if_not_present')
or
url = root.find('{*}results').attrib['attribname']
(which will throw an exception if not present).
Because of the namespace on the attribute itself, you will probably need to use the {ns}attrib syntax to look it up too.
You can dump out the attrib dictionary and just copy the attribute name out too.
text is actually the space between elements, and is not normally used but is supported both for spacing (like etreeindent) and some special cases.

Reading text from XML nodes using Python's libxml2

I am a first time XPath user and need to be able to get the text values of these different elements.. for instance time, title, etc.. I am using the libxml2 module in Python and so far have not had much luck getting just the values of the text I need. The code below here only returns the element tags.. i need the values.. any help would be GREATLY appreciated!
I'm using this code:
doc = libxml2.parseDoc(xmlOutput)
result = doc.xpathEval('//*')
With the following document:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SCAN_LIST_OUTPUT SYSTEM "https://qualysapi.qualys.com/api/2.0/fo/sca/scan_list_output.dtd">
<SCAN_LIST_OUTPUT>
<RESPONSE>
<DATETIME>2012-01-22T01:21:53Z</DATETIME>
<SCAN_LIST>
<SCAN>
<REF>scan/2343423</REF>
<TYPE>Scheduled</TYPE>
<TITLE><![CDATA[customer 1 5/20/2012]]></TITLE>
<USER_LOGIN>user1</USER_LOGIN>
<LAUNCH_DATETIME>2012-02-21T04:11:05Z</LAUNCH_DATETIME>
<STATUS>
<STATE>Finished</STATE>
</STATUS>
<TARGET><![CDATA[13.3.3.2, 13.8.8.10, 13.10.12.60, 13.10.12.11...]]></TARGET>
</SCAN>
</SCAN_LIST>
</RESPONSE>
</SCAN_LIST_OUTPUT>
You can call getContent() on each returned xmlNode object to retrieve the associated text. Note that this is recursive -- to non-recursively access text content in libxml2, you'll want to retrieve the associated text node under the element, and call .getContent() on that.
That said, this would be easier if you used lxml.etree (a higher-level Python API, still backing into the C libxml2 library) instead of the Python libxml2; in that case, it's simply element.text to access the associated content as a string.
Have a look at Mark Pilgrim's Dive Into Python 3, Chapter 12. XML
The chapter starts with short course to XML (general talk but with the Atom Syndication Feed example), then it continues with the standard xml.etree.ElementTree and continues with third party lxml that implements more with the same interface (full XPATH 1.0, based on libxml2).

Quickest/Best way to traverse XML with lxml in Python

I have an XML file that looks like this:
xml = '''<?xml version="1.0"?>
<root>
<item>text</item>
<item2>more text</item2>
<targetroot>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
<targetcontainer>
<target>text i want to get</target>
</targetcontainer>
</targetroot>
...more items
</root>
'''
With lxml I'm trying to acces the text in the element < target >. I've found a solution, but I'm sure there is a better, more efficient way to do this. My solution:
target = etree.XML(xml)
for x in target.getiterator('root'):
item1 = x.findtext('item')
for target in x.iterchildren('targetroot'):
for t in target.iterchildren('targetcontainer'):
targetText = t.findtext('target')
Although this works, as it gives me acces to all the elements in root as well as the target element, I'm having a hard time believing this is the most efficient solution.
So my question is this: is there a more efficient way to access the < target >'s texts while staying in the loop of root, because I also need access to the other elements.
You can use XPath:
for x in target.xpath('/root/targetroot/targetcontainer/target'):
print x.text
We ask all elements that match a path. In this case, the path is /root/targetroot/targetcontainer/target, which means
all the <target> elements that are inside a <targetcontainer> element, inside a <targetroot> element, inside a <root> element. Also, the <root> element should be the document root because it is preceded by /, which means the beginning of the document.
Also, your XML document had two problems. First, the <?xml version="1.0"?> declaration should be the very first thing in the document - and in this example it is preceded by a newline and some space. Also, it is not a tag and should not be closed, so the </xml> at the end of your string should be removed. I already edited your question anyway.
EDIT: this solution can be improved yet. You do not need to pass all the path - you can just ask to all elements <target> inside the document. This is done by preceding the tag name by two slashes. Since you want all the <target> texts, independent of where they are, this can be a better solution. So, the loop above can be written just as:
for x in target.xpath('//target'):
print x.text
I tried it at first but it did not worked. The problem, however, was the syntax problems in the XML, not the XPath, but I tried the other, longer path and forgot to retry this one. Sorry! Anyway, I hope I put some light about XPath nonetheless :)

Categories