Parse XML with (X)HTML entities

Parse XML with (X)HTML entities - python

Trying to parse XML, with ElementTree, that contains undefined entity (i.e. ) raises:
ParseError: undefined entity
In Python 2.x XML entity dict can be updated by creating parser (documentation):
parser = ET.XMLParser()
parser.entity["nbsp"] = unichr(160)
but how to do the same with Python 3.x?
Update: There was misunderstanding from my side, because I overlooked that I was calling parser.parser.UseForeignDTD(1) before trying to update XML entity dict, which was causing error with the parser. Luckily, #m.brindley was patient and pointed that XML entity dict still exists in Python 3.x and can be updated the same way as in Python 2.x

The issue here is that the only valid mnemonic entities in XML are quot, amp, apos, lt and gt. This means that almost all (X)HTML named entities must be defined in the DTD using the entity declaration markup defined in the XML 1.1 spec. If the document is to be standalone, this should be done with an inline DTD like so:
<?xml version="1.1" ?>
<!DOCTYPE naughtyxml [
<!ENTITY nbsp " ">
<!ENTITY copy "©">
]>
<data>
<country name="Liechtenstein">
<rank>1 ></rank>
<year>2008©</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
</data>
The XMLParser in xml.etree.ElementTree uses an xml.parsers.expat to do the actual parsing. In the init arguments for XMLParser, there is a space for 'predefined HTML entities' but that argument is not implemented yet. An empty dict named entity is created in the init method and this is what is used to look up undefined entities.
I don't think expat (by extension, the ET XMLParser) is able to handle switching namespaces to something like XHMTL to get around this. Possibly because it will not fetch external namespace definitions (I tried making xmlns="http://www.w3.org/1999/xhtml" the default namespace for the data element but it did not play nicely) but I can't confirm that. By default, expat will raise an error against non XML entities but you can get around that by defining an external DOCTYPE - this causes the expat parser to pass undefined entity entries back to the ET.XMLParser's _default() method.
The _default() method does a look up of the entity dict in the XMLParser instance and if it finds a matching key, it will replace the entity with the associated value. This maintains the Python-2.x syntax mentioned in the question.
Solutions:
If the data does not have an external DOCTYPE and has (X)HTML mnemonic entities, you are out of luck. It is not valid XML and expat is right to throw an error. You should add an external DOCTYPE.
If the data has an external DOCTYPE, you can just use your old syntax to map mnemonic names to characters. Note: you should use chr() in py3k - unichr() is not a valid name anymore
Alternatively, you could update XMLParser.entity with html.entities.html5 to map all valid HTML5 mnemonic entities to their characters.
If the data is XHTML, you could subclass HTMLParser to handle mnemonic entities but this won't return an ElementTree as desired.
Here is the snippet I used - it parses XML with an external DOCTYPE through HTMLParser (to demonstrate how to add entity handling by subclassing), ET.XMLParser with entity mappings and expat (which will just silently ignore undefined entities due to the external DOCTYPE). There is a valid XML entity (>) and an undefined entity (©) which I map to chr(0x24B4) with the ET.XMLParser.
from html.parser import HTMLParser
from html.entities import name2codepoint
import xml.etree.ElementTree as ET
import xml.parsers.expat as expat
xml = '''<?xml version="1.0"?>
<!DOCTYPE data PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<data>
<country name="Liechtenstein">
<rank>1></rank>
<year>2008©</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
</data>'''
# HTMLParser subclass which handles entities
print('=== HTMLParser')
class MyHTMLParser(HTMLParser):
def handle_starttag(self, name, attrs):
print('Start element:', name, attrs)
def handle_endtag(self, name):
print('End element:', name)
def handle_data(self, data):
print('Character data:', repr(data))
def handle_entityref(self, name):
self.handle_data(chr(name2codepoint[name]))
htmlparser = MyHTMLParser()
htmlparser.feed(xml)
# ET.XMLParser parse
print('=== XMLParser')
parser = ET.XMLParser()
parser.entity['copy'] = chr(0x24B8)
root = ET.fromstring(xml, parser)
print(ET.tostring(root))
for elem in root:
print(elem.tag, ' - ', elem.attrib)
for subelem in elem:
print(subelem.tag, ' - ', subelem.attrib, ' - ', subelem.text)
# Expat parse
def start_element(name, attrs):
print('Start element:', name, attrs)
def end_element(name):
print('End element:', name)
def char_data(data):
print('Character data:', repr(data))
print('=== Expat')
expatparser = expat.ParserCreate()
expatparser.StartElementHandler = start_element
expatparser.EndElementHandler = end_element
expatparser.CharacterDataHandler = char_data
expatparser.Parse(xml)

I was having a similar issue and got around it by using lxml. Its etree.XMLParser has a recover keyword argument which forces it to try to ignore broken XML.

from xml.etree import ElementTree
from html.entities import name2codepoint
from io import StringIO
import unicodedata
url = "https://docs.python.org/3/library/html.parser.html"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
html_response = requests.get(url=url, headers = headers)
def getParser():
return xp
with open("sample.html", "w", encoding="utf-8") as html_file:
html_file.write(html_response.text)
xp = ElementTree.XMLParser()
for k, v in name2codepoint.items(): xp.entity[k] = chr(v)
with open("sample.html", "r", encoding="utf-8") as html_file:
html = html_file.read()
b = StringIO(html)
t = ElementTree.parse(b, xp)
print(t)

Related

Unexpected results when parsing XML via lxml

The output of my xml parsing is not es expected.
The xml file
<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
I would expect to get something like root.tag == 'stationaer' and child.tag = 'einrichtung'.
See the outpout at the end.
This is the MWE
#!/usr/bin/env python3
import pathlib
import lxml
from lxml import etree
import pandas
xml_src = '''<?xml version="1.0"?>
<stationaer xsi:schemaLocation="http:/foo.bar" xmlns="http://foo.bar" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<einrichtung>
<name>Name</name>
</einrichtung>
<einrichtung>
<name>Name</name>
</einrichtung>
</stationaer>
'''
# tree = etree.parse(file_path)
# root = tree.getroot()
root = etree.fromstring(xml_src)
print(repr(root.tag))
print(repr(root.text))
child = root.getchildren()[0]
print(repr(child.tag))
print(repr(child.text))
The output for root is
'{http://foo.bar}stationaer'
'\n '
and for child
'{http://foo.bar}einrichtung'
'\n '
I don't understand what's going on here and why that URL is in the output.

This is actually not unexpected. The elements in the XML document are bound to the http://foo.bar default namespace. The namespace is declared by xmlns="http://foo.bar" on the root element and the declaration is inherited by all descendants.
The special notation with the namespace URI enclosed in curly braces ({http://foo.bar}stationaer) is never used in XML documents, but it is used by lxml and ElementTree when printing element (tag) names. It can also be used when searching or creating elements that belong to a namespace.
More information:
https://www.w3.org/TR/xml-names/
https://lxml.de/tutorial.html#namespaces
https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

Replace contents of lxml StringElement with starting a namespace war?

I can't figure out how to replace contents of lxml StringElement (styleUrl in this case) which already has a namespace (other than pytype). I end up getting an element level namespace injected. This is a much distilled and simplified version that only tries to rename one StyleMap to illustrate the issue:
#!/usr/bin/env python
from __future__ import print_function
import sys
from pykml import parser as kmlparser
from lxml import objectify
frm = "lineStyle30218901714341461519022"
to = "s1"
b4_et = kmlparser.parse('b4.kml')
b4_root = b4_et
el = b4_root.xpath('//*[#id="%s"]' % frm)[0]
el.attrib['id'] = to
el = b4_root.xpath('//*[text()="#%s"]' % frm)[0]
el.xpath('./..')[0].styleUrl = '#'+to
objectify.deannotate(b4_root, xsi_nil=True)
b4_et.write(sys.stdout, pretty_print=True)
test data:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<name>Wasatch Trails</name>
<open>1</open>
<Style id="lineStyle30218901714341461519049">
<LineStyle><color>ff0080ff</color><width>4</width></LineStyle>
</Style>
<Style id="lineStyle30218901714341461519027">
<LineStyle><color>ff0080ff</color><width>4</width></LineStyle>
</Style>
<StyleMap id="lineStyle30218901714341461519022">
<Pair><key>normal</key><styleUrl>#lineStyle30218901714341461519049</styleUrl></Pair>
<Pair><key>highlight</key><styleUrl>#lineStyle30218901714341461519027</styleUrl></Pair>
</StyleMap>
<Placemark>
<name>Trail</name>
<styleUrl>#lineStyle30218901714341461519022</styleUrl>
<LineString>
<tessellate>1</tessellate>
<coordinates>
-111.6472637672589,40.4810633294269,0 -111.650415221546,40.48116138407261,0 -111.6504410181637,40.48118694372887,0
</coordinates>
</LineString>
</Placemark>
</Document>
</kml>
The only issue I have not been able to resolve is lxml putting a xmlns:py="http://codespeak.net/lxml/objectify/pytype" attribute into the newly created styleUrl element. I'm guessing this is caused by the document having a default namespace for kml/2.2. I don't know how to tell it the new styleUrl should be kml instead of pytype.
...
<styleUrl xmlns:py="http://codespeak.net/lxml/objectify/pytype">#s1</styleUrl>
...

Replacing following:
el.xpath('./..')[0].styleUrl = '#'+to
with:
el.xpath('./..')[0].styleUrl = objectify.StringElement('#' + to)
will give you what you want. But I'm not sure whether this is the best way.
BTW, you can use set(key, value) method to set attribute value:
el.set('id', to) # isntead of el.attrib['id'] = to

is there a way to get the attribute text directly in a xml without traversing through children in elementree in python

I am using the python module : xml.etree.ElementTree for parsing xml files.
I am curious to know if there is a way to directly find an attribute that is nested deeply.
For example if I want to get the name attribute of neigbhor (if it exists),
I need to traverse through country/rank/year/gdppc, if my root is data. Is there a quick way to look up that attribute?
<data>
<country name="Liechtenstein">
<rank>
<year>
<gdppc>
<neighbor name="Austria" direction="E"/>
</gdppc>
</year>
</rank>
</country>
</data>
EDIT:
I tried something on this line. but did not help; I am not sure if I should be using resp.content for the xml retrived
resp=requests.get(url_fetch,params=query)
with open(resp.content) as fd:
doc = ElementTree.parse(fd)
name = doc.find('PubmedArticle//Volume').text
print name
here is the xml:

Depending on what your data looks like and exactly what you're trying to accomplish, you could do something like this:
with open('data.xml') as fd:
doc = ElementTree.parse(fd)
name = doc.find('country[#name="Liechtenstein"]//neighbor').get('name')
print name
Which given the input above would yield:
Austria
If you're parsing XML with Python, you may want to look at the lxml module, which has full support for XPath queries.
This works for me with the URL you gave above:
#!/usr/bin/python
import requests
from xml.etree import ElementTree
res = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=24059499&retmode=xml')
doc = ElementTree.fromstring(res.content)
ele = doc.find('.//PubmedArticle//Volume')
print ele.text

Error in escaping XML for a KML file

Some time ago I asked a question trying to figure out why modifying a KML file increased the file size.
After poking around, I've found that the issue had to do with escaping XML.
Essentially, the "<", ">", and "&" characters were being replaced with:
"<", ">", and "&"
It's not a big deal for smaller files, but the extra characters make a big difference in larger files.
I copied some code from this site to help solve the problem:
import lxml
from lxml import etree
import pykml
from pykml.factory import KML_ElementMaker as KML
from pykml import parser
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
## Ampersands must be last to avoid errors in text replacement
s = s.replace("&", "&")
return s
with open("myplaces.kml", "rb") as f:
doc = parser.parse(f).getroot()
a = doc.Document.Folder[0].Folder[1]
for q in GEList:
x = KML.Folder(KML.name(q))
a.append(x)
finished = (etree.tostring(doc, pretty_print = True))
finished = unescape(finished)
with open("myplaces.kml", "wb") as f:
f.write(finished)
Now however, I'm running into another error. I compared the file before and after I replaced the <, >, and & characters.
Before: <description><![CDATA[<img src="fedland_leg_pop_2.jpg" alt="headerimg" width="550" height="77"><br>
After: <description><img src="fedland_leg_pop_2.jpg" alt="headerimg" width="550" height="77"><br>
Now it seems to be throwing out "< ![CDATA[", & I can't figure out why.

I had the same issue but then I found this (https://developers.google.com/kml/documentation/kml_tut#descriptive_html):
Using the CDATA Element
If you want to write standard HTML inside a tag, you can put it inside a CDATA tag. If you don't, the angle brackets need to be written as entity references to prevent Google Earth from parsing the HTML incorrectly (for example, the symbol > is written as > and the symbol < is written as <). This is a standard feature of XML and is not unique to Google Earth.
Consider the difference between HTML markup with CDATA tags and without CDATA. First, here's the with CDATA tags:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>CDATA example</name>
<description>
<![CDATA[
<h1>CDATA Tags are useful!</h1>
<p><font color="red">Text is <i>more readable</i> and
<b>easier to write</b> when you can avoid using entity
references.</font></p>
]]>
</description>
<Point>
<coordinates>102.595626,14.996729</coordinates>
</Point>
</Placemark>
</Document>
</kml>
And here's the without CDATA tags, so that special characters must use entity references:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Document>
<Placemark>
<name>Entity references example</name>
<description>
<h1>Entity references are hard to type!</h1>
<p><font color="green">Text is
<i>more readable</i>
and <b>easier to write</b>
when you can avoid using entity references.</font></p>
</description>
<Point>
<coordinates>102.594411,14.998518</coordinates>
</Point>
</Placemark>
</Document>
</kml>

How to resolve external entities with xml.etree like lxml.etree

I have a script that parses XML using lxml.etree:
from lxml import etree
parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
tree = etree.parse('main.xml', parser=parser)
I need load_dtd=True and resolve_entities=True be have &emptyEntry; from globals.xml resolved:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE map SYSTEM "globals.xml" [
<!ENTITY dirData "${DATADIR}">
]>
<map
xmlns:map="http://my.dummy.org/map"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsschemaLocation="http://my.dummy.org/map main.xsd"
>
&emptyEntry; <!-- from globals.xml -->
<entry><key>KEY</key><value>VALUE</value></entry>
<entry><key>KEY</key><value>VALUE</value></entry>
</map>
with globals.xml
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY emptyEntry "<entry></entry>">
Now I would like to move from non-standard lxml to standard xml.etree. But this fails with my file because the load_dtd=True and resolve_entities=True is not supported by xml.etree.
Is there an xml.etree-way to have these entities resolved?

My trick is to use the external program xmllint
proc = subprocess.Popen(['xmllint','--noent',fname],stdout=subprocess.PIPE)
output = proc.communicate()[0]
tree = ElementTree.parse(StringIO.StringIO(output))

lxml is a right tool for the job.
But, if you want to use stdlib, then be prepared for difficulties and take a look at XMLParser's UseForeignDTD method. Here's a good (but hacky) example: Python ElementTree support for parsing unknown XML entities?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse XML with (X)HTML entities - python

I was having a similar issue and got around it by using lxml. Its etree.XMLParser has a recover keyword argument which forces it to try to ignore broken XML.

Related

Unexpected results when parsing XML via lxml

Replace contents of lxml StringElement with starting a namespace war?

is there a way to get the attribute text directly in a xml without traversing through children in elementree in python

Error in escaping XML for a KML file

How to resolve external entities with xml.etree like lxml.etree

Categories

Resources