Error 'failed to load external entity' when using Python lxml - python

I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:
': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:
import urllib2
import lxml.etree as etree
file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()
tree = etree.parse(data)

In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.
Example:
from lxml import etree
from StringIO import StringIO
myString = "<html><p>blah blah blah</p></html>"
tree = etree.parse(StringIO(myString))
This method is used in the lxml documentation.

etree.parse(source) expects source to be one of
a file name/path
a file object
a file-like object
a URL using the HTTP or FTP protocol
The problem is that you are supplying the XML content as a string.
You can also do without urllib2.urlopen(). Just use
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
Demonstration (using lxml 2.3.4):
>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>
In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.
If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:
from lxml import etree
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0]
# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()

lxml docs for parse says To parse from a string, use the fromstring() function instead.
parse(...)
parse(source, parser=None, base_url=None)
Return an ElementTree object loaded with source elements. If no parser
is provided as second argument, the default parser is used.
The ``source`` can be any of the following:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
To parse from a string, use the ``fromstring()`` function instead.
Note that it is generally faster to parse from a file path or URL
than from an open file object or file-like object. Transparent
decompression from gzip compressed sources is supported (unless
explicitly disabled in libxml2).

You're getting that error because the XML you're loading references an external resource:
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.
The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

Related

Parsing file object XML with lxml returns external entity error [duplicate]

I'm trying to parse an XML document I retrieve from the web, but it crashes after parsing with this error:
': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
That is the second line in the XML that is downloaded. Is there a way to prevent the parser from trying to load the external entity, or another way to solve this? This is the code I have so far:
import urllib2
import lxml.etree as etree
file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
data = file.read()
file.close()
tree = etree.parse(data)
In concert with what mzjn said, if you do want to pass a string to etree.parse(), just wrap it in a StringIO object.
Example:
from lxml import etree
from StringIO import StringIO
myString = "<html><p>blah blah blah</p></html>"
tree = etree.parse(StringIO(myString))
This method is used in the lxml documentation.
etree.parse(source) expects source to be one of
a file name/path
a file object
a file-like object
a URL using the HTTP or FTP protocol
The problem is that you are supplying the XML content as a string.
You can also do without urllib2.urlopen(). Just use
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
Demonstration (using lxml 2.3.4):
>>> from lxml import etree
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
>>> tree.getroot()
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08>
>>>
In a competing answer, it is suggested that lxml fails because of the stylesheet referenced by the processing instruction in the document. But that is not the problem here. lxml does not try to load the stylesheet, and the XML document is parsed just fine if you do as described above.
If you want to actually load the stylesheet, you have to be explicit about it. Something like this is needed:
from lxml import etree
tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")
# Create an _XSLTProcessingInstruction object
pi = tree.xpath("//processing-instruction()")[0]
# Parse the stylesheet and return an ElementTree
xsl = pi.parseXSL()
lxml docs for parse says To parse from a string, use the fromstring() function instead.
parse(...)
parse(source, parser=None, base_url=None)
Return an ElementTree object loaded with source elements. If no parser
is provided as second argument, the default parser is used.
The ``source`` can be any of the following:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
To parse from a string, use the ``fromstring()`` function instead.
Note that it is generally faster to parse from a file path or URL
than from an open file object or file-like object. Transparent
decompression from gzip compressed sources is supported (unless
explicitly disabled in libxml2).
You're getting that error because the XML you're loading references an external resource:
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>
LXML doesn't know how to resolve GreenButtonDataStyleSheet.xslt. You and I probably realize that it's going to be available relative to your original URL, http://www.greenbuttondata.org/data/15MinLP_15Days.xml...the trick is to tell lxml how to go about loading it.
The lxml documentation includes a section titled "Document loading and URL resolving", which has just about all the information you need.

View an XML file in Tree View using appJar

I am making a GUI using appJar(python Library, uses Tkinter). I have an XML file.
I am parsing the XML file using ElementTree XML parsing library.
I want to see my XML file in a tree view.
So I am parsing the file using Element Tree, Getting the tags in Need to show in the Treeview and Forming a new XML object. and passing the new object in the appJar Function: .addTree().
But I am Getting the error as:
..lib\site-packages\appJar\appjar.py", line 8764, in addTree
xmlDoc = parseString(data).
...lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
TypeError: a bytes-like object is required, not 'ElementTree'
xml=et.Element(root)
print(xml)
for ele in valList:
reg=et.SubElement(xml, ele.find('Name').text)
bitroot= ele.findall('Bit')
for bit in bitroot:
et.SubElement(reg, bit.find('Name').text)
xmltree= et.ElementTree(xml)
app.startFrame('bottomleft',1,0,2)
app.setBg('orange')
app.setSticky('news')
app.setStretch('none')
app.addTree('REGISTER', xmltree)
I am getting the error, as far as I can understand is because .addTree() API is unable to read the format of xmltree variable.
According to appJar documentation, you need to pass an XML string to .addTree(), not an ElementTree. According to ElementTree documentation, you can use xml.etree.ElementTree.tostring() to build an XML string from your Element:
xml_string = et.tostring(xml)
app.addTree('REGISTER', xml_string)

Remove ns0 from XML

I have an XML file where I would like to edit certain attributes. I am able to properly edit the attributes but when I write the changes to the file, the tags have a strange "ns0" added onto them. How can I get rid of this? This is what I have tried and have been unsuccessful. I am working in Python and using lxml.
import xml.etree.ElementTree as ET
from xml.etree import ElementTree as etree
from lxml import etree, objectify
frag_xml_tree = ET.parse(xml_name)
frag_root = frag_xml_tree.getroot()
for e in frag_root:
for elem in frag_root.iter(e):
elem.attrib[frag_param_name] = update_val
etree.register_namespace("", "http://www.w3.org/2001")
frag_xml_tree.write(xml_name)
However, when I do this, I only get the error Invalid tag name u''. I thought this error came up if the xml tags started with digits but that is not the case with my xml. I am really stuck on how to proceed. Thanks
Actually the way to do it seemed to be a combination of two things.
The import statement is import xml.etree.ElementTree as ET
ET.register_namespace("", NAMESPACE) is the correct call, where NAMESPACE is the namespace listed in the input xml, ie the url after xmlns.
Here's the corrected code using only xml.etree.ElementTree instead of lxml:
import xml.etree.ElementTree as ET
frag_xml_tree = ET.parse(xml_name)
frag_root = frag_xml_tree.getroot()
for e in frag_root:
for elem in frag_root.iter(e):
elem.attrib[frag_param_name] = update_val
ET.register_namespace("", "http://www.w3.org/2001")
frag_xml_tree.write(xml_name)
the following snippet removes the presence of ns0 throughout the xml file
for i in range (0,len(list(root))):
print(root[i])
ET.register_namespace("",NAMESPACE)
tree.write('TP_updated2.xml',xml_declaration=True,method='xml',encoding="utf8",default_namespace=None)
NAMESPACE = the url after the xmlns

Reading a page and parsing it with minidom.parse or minidom.parseString in Python?

I have either of these codes:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parse(res)
which gives me the error xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Or this:
import urllib
from xml.dom import minidom
res = urllib.urlopen('https://www.google.com/webhp#q=apple&start=10')
dom = minidom.parseString(res.read())
which gives me the same error. res.read() reads fine and is a string.
I would like to parse through the code later. How can I do this using xml.dom.minidom?
The reason you're getting this error is that the page isn't valid XML. It's HTML 5. The doctype right at the top tells you this, even if you ignore the content type. You can't parse HTML with an XML parser.*
If you want to stick with what's in the stdlib, you can use html.parser (Python 3.x) / HTMLParser (2.x).** However, you may want to consider third-party libraries like lxml (which, despite the name, can parse HTML), html5lib, or BeautifulSoup (which wraps up a lower-level parser in a really nice interface).
* Well, unless it's XHTML, or the XML output of HTML5, but that's not the case here.
** Do not use htmllib unless you're using an old version of Python without a working HTMLParser. This module is deprecated for a reason.

python feedparser

How would you parse xml data as follows with python feedparser
<Book_API>
<Contributor_List>
<Display_Name>Jason</Display_Name>
</Contributor_List>
<Contributor_List>
<Display_Name>John Smith</Display_Name>
</Contributor_List>
</Book_API>
That doesn't look like any sort of RSS/ATOM feed. I wouldn't use feedparser at all for that, I would use lxml. In fact, feedparser can't make any sense of it and drops the "Jason" contributor in your example.
from lxml import etree
data = <fetch the data somehow>
root = etree.parse(data)
Now you have a tree of xml objects. How to do it in lxml more specifically is impossible to say until you actually give valid XML data. ;)
As Lennart Regebro mentioned, it seems not a RSS/Atom feed but just XML document. There are several XML parsing facilities (SAX and DOM both) in Python standard libraries. I recommend you ElementTree. Also lxml is best one (which is drop-in replacement of ElementTree) in third party libraries.
try:
from lxml import etree
except ImportError:
try:
from xml.etree.cElementTree as etree
except ImportError:
from xml.etree.ElementTree as etree
doc = """<Book_API>
<Contributor_List>
<Display_Name>Jason</Display_Name>
</Contributor_List>
<Contributor_List>
<Display_Name>John Smith</Display_Name>
</Contributor_List>
</Book_API>"""
xml_doc = etree.fromstring(doc)

Categories