Parse XHTML5 with undefined entities

Parse XHTML5 with undefined entities - python

Please consider this:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
which renders nice text representation of xhtml string.
But, for same XHTML document with HTML5 doctype:
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
I get Exception:
xml.etree.ElementTree.ParseError: undefined entity: line 5, column 19
so the parser can't handle it, although I added nbsp to entities dict.
Same happens if I use lxml:
from lxml import etree
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xhtml, parser=parser)
print etree.tostring(tree, method='xml')
raises:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 26
although I've set the parser to ignore entities.
Why is this, and how to make parsing of XHTML files with HTML5 doctype declaration possible?
Partial solution for lxml is to use recoverer:
parser = etree.XMLParser(resolve_entities=False, recover=True)
but I'm still waiting for better one.

The problem here is, the Expat parser used behind the scenes won't usually report unknown entities - it will rather throw an error, so the fallback code in xml.etree.ElementTree you were trying to trigger won't even run. You can use the UseForeignDTD method to change this behavior, it will make Expat ignore the doctype declaration and pass all entity declarations to xml.etree.ElementTree. The following code works correctly:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser._parser.UseForeignDTD(True)
parser.entity['nbsp'] = u'\u00A0'
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
The side-effect of this approach: as I said, the doctype declaration is completely ignored. This means that you have to declare all entities, even the ones supposedly covered by the doctype.
Note that the values you put into ElementTree.XMLParser.entity dictionary have to be regular strings, text that the entity will be replaced by - you can no longer refer to other entities there. So it should be u'\u00A0' for .

Related

How to prevent lxml from adding a default doctype

lxml seems to add a default doctype when one is missing in the html document.
See this demo code:
import lxml.etree
import lxml.html
def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)
with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
How can I tell lxml to not do this?
This issue was originally raised here:
https://github.com/mitmproxy/mitmproxy/issues/845
Quoting a comment on reddit as it might be helpful:
lxml is based on libxml2, which does this by default unless you pass the option HTML_PARSE_NODEFDTD, I believe. Code here.
I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.

There is currently no way to do this in lxml, but I've created a Pull Request on lxml which adds a default_doctype boolean to the HTMLParser.
Once the code gets merged in, the parser needs to be created like so:
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True,
default_doctype=False,
)
Everything else stays the same.

How can I get Page Language (xml:lang="") using lxml library?

I'm very new to lxml library and find it very confusing to parse anything but links for the moment.
I read the documents but I'm struggling to get the xml:lang=".." attribute's value from the top <html ..> tag.
How can I read that value?
Example: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en">

>>> import lxml.html
>>> s = '''<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en"></html>'''
>>> root = lxml.html.fromstring(s)
>>> root.get('xml:lang')
'en-GB'

HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:
1.
<!DOCTYPE html>
<html lang='en'>
<head>
<title>Unicode Chars: 은 —’</title>
<meta charset='utf-8'>
</head>
<body></body>
</html>
2.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
<title>Unicode Chars: 은 —’</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
3.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>
My basic script:
from lxml.html import fromstring
...
doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
The results are:
Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’
So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.
The lxml docs appear conflicted:
From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.
from BeautifulSoup import UnicodeDammit
def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode
root = lxml.html.fromstring(decode_html(tag_soup))
However here it says:
[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.
If I try to follow the the first suggestion in the lxml docs, my code is now:
from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title
I now get the following results:
Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.
Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.
Is there a correct way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html)
try:
doc = fromstring(dammit.unicode_markup)
except ValueError:
doc = fromstring(raw_html)

lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:
#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit
for filename in glob.glob('*.html'):
with open(filename, 'rb') as file:
content = file.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
title = root.find('.//title').text_content()
print(title)
Output
Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

The problem probably stems from the fact that <meta charset> is a relatively new standard (HTML5 if I'm not mistaken, or it wasn't really used before it.)
Until such a time when the lxml.html library is updated to reflect it, you will need to handle that case specially.
If you only care about ISO-8859-* and UTF-8, and can afford to throw away non-ASCII compatible encodings (such as UTF-16 or the East Asian traditional charsets), you can do a regular expression substitution on the byte string, replacing the newer <meta charset> with the older http-equiv format.
Otherwise, if you need a proper solution, your best bet is to patch the library yourself (and contributing the fix while you're at it.) You might want to ask the lxml developers if they have any half-baked code laying around for this particular bug, or if they are tracking the bug on their bug tracking system in the first place.

Python BeautifulSoup double question mark in xml definition

I thought it must be a bug so i issued a bug report here.
On the other hand i might be missing something so i need another look on the code.
The problem is, when i initialize BeautifulSoup with contents of an .xhtml file, xml definition gets two question marks at the end of it.
Can you reproduce the problem? Is there a way to avoid it? Am i missing a function, a method, an argument or something?
Edit0: It's BeautifulSoup 4 on python 2.x.
Edit1: Why downvote?
The problem:
<?xml version="1.0" encoding="UTF-8"??>
Terminal Output:
>>> from bs4 import BeautifulSoup as bs
>>> with open('example.xhtml', 'r') as f:
... txt = f.read()
... soup = bs(txt)
...
>>> print txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
>>> print soup
<?xml version="1.0" encoding="UTF-8"??>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>

This is a bug. I've committed a fix which will be in the next release of Beautiful Soup.
The root cause:
The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.
In general, you'll have better results parsing XHTML with the "xml" parser, as ThiefMaster suggested.

Consider using an XML parser:
soup = bs(txt, 'xml')

Using your contents of variable in 'txt' in an example.xhtml file I can't reproduce the issue with Python2.7 and the corrosponding BeautifulSoup module (not bs4). Works fine and dandy for me.
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> print soup
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
</head>
<body>
</body>
</html>
What is the issue you are encountering with it, as in what is your end aim, maybe then somoone can suggest a workaround

lxml unicode characters

I'm new to lxml and python. I'm trying to parse an html document. When I parse using the standard xml parser it will write the characters out correctly but I think it fails to parse as I have trouble searching it with xpath.
Example file being parsed:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>title</title>
</head>
<body>
<span id="demo">Garbléd charactérs</span>
</body>
</html>
Parsing code:
from lxml import etree
fname = 'output/so-help.html'
# parse
hparser = etree.HTMLParser()
htree = etree.parse(fname, hparser)
# garbled
htree.write('so-dumpu.html', encoding='utf-8')
# targets
demo_name = htree.xpath("//span[#id='demo']")
# garbled
print 'name: "' + demo_name[0].text
Terminal output:
name: "GarblÃ©d charactÃ©rs
htree.write output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title></head><body>
<span id="demo">GarblÃ©d charactÃ©rs</span>
</body></html>

the problem was that you tried to encode an already encoded data, what you need is to let
parser decode the data with utf-8.
* in your original code try demo_name[0].text.decode('utf-8') and you will see
the right way to do it :
from lxml import etree
fname = 'output/so-help.html'
# parse
hparser = etree.HTMLParser(encoding='utf-8')
htree = etree.parse(fname, hparser)
# garbled
htree.write('so-dumpu.html')
# targets
demo_name = htree.xpath("//span[#id='demo']")
# garbled
print 'name: "' + demo_name[0].text

Try changing the output encoding:
htree.write('so-dumpu.html', encoding='latin1')
and
print 'name: "' + demo_name[0].text.encode('latin1')

I assume your XHTML document is encoded in utf-8. The issue is that the encoding is not specified in the HTML document. By default, browsers and lxml.html assume HTML documents are encoded in ISO-8859-1, that's why your document is incorrectly parsed. If you open it in your browser, it will also be displayed incorrectly.
You can specify the encoding of your document like this :
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>title</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
You can force the encoding used by lxml this way (like your can change the encoding used in your browser) :
file = open(fname)
filecontents = file.read()
filecontents = filecontents.decode("utf-8")
htree = lxml.html.fromstring(filecontents)
print htree.xpath("//span[#id='demo']")[0].text

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse XHTML5 with undefined entities - python

Related

How to prevent lxml from adding a default doctype

How can I get Page Language (xml:lang="") using lxml library?

HTML encoding and lxml parsing

Python BeautifulSoup double question mark in xml definition

lxml unicode characters

Categories

Resources