How can I get Page Language (xml:lang="") using lxml library? - python

I'm very new to lxml library and find it very confusing to parse anything but links for the moment.
I read the documents but I'm struggling to get the xml:lang=".." attribute's value from the top <html ..> tag.
How can I read that value?
Example: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en">

>>> import lxml.html
>>> s = '''<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-GB" lang="en"></html>'''
>>> root = lxml.html.fromstring(s)
>>> root.get('xml:lang')
'en-GB'

Related

parse html using Python's "xml" module ParseError on meta tag

I'm trying to parse some html using the xml python library. The html I'm trying to parse is from download.docker.com which breaks out to,
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</title>
</head>
<body>
<h1>Index of linux/ubuntu/dists/jammy/pool/stable/amd64/</h1>
<hr>
<pre>../
containerd.io_1.5.10-1_amd64.deb
...
</pre><hr></body></html>
Parsing the html with the following code,
import urllib
import xml.etree.ElementTree as ET
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = ET.fromstring(html_doc)
>>> ParseError: mismatched tag: line 6, column 2
unless I'm mistaken, this is because of the <meta charset="UTF-8">. Using something like lxml, I can make this work with,
import urllib
from lxml import html
html_doc = urllib.request.urlopen(<MY_URL>).read()
root = = html.fromstring(html_doc)
Is there any way to parse this html using the xml python library instead of lxml?
Is there any way to parse this html using the xml python library instead of lxml?
The answer is no.
An XML library (for example xml.etree.ElementTree) cannot be used to parse arbitrary HTML. It can be used to parse HTML that also happens to be well-formed XML. But your HTML document is not well-formed.
lxml on the other hand can be used for both XML and HTML.
By the way, note that "the xml python library" is ambiguous. There are several submodules in the xml package in the standard library (https://docs.python.org/3/library/xml.html). All of them will reject the HTML document in the question.

How to prevent lxml from adding a default doctype

lxml seems to add a default doctype when one is missing in the html document.
See this demo code:
import lxml.etree
import lxml.html
def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)
with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
How can I tell lxml to not do this?
This issue was originally raised here:
https://github.com/mitmproxy/mitmproxy/issues/845
Quoting a comment on reddit as it might be helpful:
lxml is based on libxml2, which does this by default unless you pass the option HTML_PARSE_NODEFDTD, I believe. Code here.
I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.
There is currently no way to do this in lxml, but I've created a Pull Request on lxml which adds a default_doctype boolean to the HTMLParser.
Once the code gets merged in, the parser needs to be created like so:
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True,
default_doctype=False,
)
Everything else stays the same.

Parse XHTML5 with undefined entities

Please consider this:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
which renders nice text representation of xhtml string.
But, for same XHTML document with HTML5 doctype:
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
I get Exception:
xml.etree.ElementTree.ParseError: undefined entity: line 5, column 19
so the parser can't handle it, although I added nbsp to entities dict.
Same happens if I use lxml:
from lxml import etree
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xhtml, parser=parser)
print etree.tostring(tree, method='xml')
raises:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 26
although I've set the parser to ignore entities.
Why is this, and how to make parsing of XHTML files with HTML5 doctype declaration possible?
Partial solution for lxml is to use recoverer:
parser = etree.XMLParser(resolve_entities=False, recover=True)
but I'm still waiting for better one.
The problem here is, the Expat parser used behind the scenes won't usually report unknown entities - it will rather throw an error, so the fallback code in xml.etree.ElementTree you were trying to trigger won't even run. You can use the UseForeignDTD method to change this behavior, it will make Expat ignore the doctype declaration and pass all entity declarations to xml.etree.ElementTree. The following code works correctly:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser._parser.UseForeignDTD(True)
parser.entity['nbsp'] = u'\u00A0'
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
The side-effect of this approach: as I said, the doctype declaration is completely ignored. This means that you have to declare all entities, even the ones supposedly covered by the doctype.
Note that the values you put into ElementTree.XMLParser.entity dictionary have to be regular strings, text that the entity will be replaced by - you can no longer refer to other entities there. So it should be u'\u00A0' for .

HTML encoding and lxml parsing

I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:
1.
<!DOCTYPE html>
<html lang='en'>
<head>
<title>Unicode Chars: 은 —’</title>
<meta charset='utf-8'>
</head>
<body></body>
</html>
2.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
<title>Unicode Chars: 은 —’</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
3.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>
My basic script:
from lxml.html import fromstring
...
doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
The results are:
Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’
So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.
The lxml docs appear conflicted:
From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.
from BeautifulSoup import UnicodeDammit
def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode
root = lxml.html.fromstring(decode_html(tag_soup))
However here it says:
[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.
If I try to follow the the first suggestion in the lxml docs, my code is now:
from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title
I now get the following results:
Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.
Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.
Is there a correct way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html)
try:
doc = fromstring(dammit.unicode_markup)
except ValueError:
doc = fromstring(raw_html)
lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:
#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit
for filename in glob.glob('*.html'):
with open(filename, 'rb') as file:
content = file.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
title = root.find('.//title').text_content()
print(title)
Output
Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’
The problem probably stems from the fact that <meta charset> is a relatively new standard (HTML5 if I'm not mistaken, or it wasn't really used before it.)
Until such a time when the lxml.html library is updated to reflect it, you will need to handle that case specially.
If you only care about ISO-8859-* and UTF-8, and can afford to throw away non-ASCII compatible encodings (such as UTF-16 or the East Asian traditional charsets), you can do a regular expression substitution on the byte string, replacing the newer <meta charset> with the older http-equiv format.
Otherwise, if you need a proper solution, your best bet is to patch the library yourself (and contributing the fix while you're at it.) You might want to ask the lxml developers if they have any half-baked code laying around for this particular bug, or if they are tracking the bug on their bug tracking system in the first place.

Python BeautifulSoup double question mark in xml definition

I thought it must be a bug so i issued a bug report here.
On the other hand i might be missing something so i need another look on the code.
The problem is, when i initialize BeautifulSoup with contents of an .xhtml file, xml definition gets two question marks at the end of it.
Can you reproduce the problem? Is there a way to avoid it? Am i missing a function, a method, an argument or something?
Edit0: It's BeautifulSoup 4 on python 2.x.
Edit1: Why downvote?
The problem:
<?xml version="1.0" encoding="UTF-8"??>
Terminal Output:
>>> from bs4 import BeautifulSoup as bs
>>> with open('example.xhtml', 'r') as f:
... txt = f.read()
... soup = bs(txt)
...
>>> print txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
>>> print soup
<?xml version="1.0" encoding="UTF-8"??>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
This is a bug. I've committed a fix which will be in the next release of Beautiful Soup.
The root cause:
The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.
In general, you'll have better results parsing XHTML with the "xml" parser, as ThiefMaster suggested.
Consider using an XML parser:
soup = bs(txt, 'xml')
Using your contents of variable in 'txt' in an example.xhtml file I can't reproduce the issue with Python2.7 and the corrosponding BeautifulSoup module (not bs4). Works fine and dandy for me.
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> print soup
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
</head>
<body>
</body>
</html>
What is the issue you are encountering with it, as in what is your end aim, maybe then somoone can suggest a workaround

Categories