I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered:
1.
<!DOCTYPE html>
<html lang='en'>
<head>
<title>Unicode Chars: 은 —’</title>
<meta charset='utf-8'>
</head>
<body></body>
</html>
2.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
<title>Unicode Chars: 은 —’</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>
3.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>
My basic script:
from lxml.html import fromstring
...
doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title
The results are:
Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’
So, obviously an issue with sample 1 and the missing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> tag. The solution from here will correctly recognize sample 1 as utf-8 and so it is functionally equivalent to my original code.
The lxml docs appear conflicted:
From here the example seems to suggest we should use UnicodeDammit to encode the markup as unicode.
from BeautifulSoup import UnicodeDammit
def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode
root = lxml.html.fromstring(decode_html(tag_soup))
However here it says:
[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.
If I try to follow the the first suggestion in the lxml docs, my code is now:
from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title
I now get the following results:
Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.
Sample 1 now works correctly but sample 3 results in an error due to the <?xml version="1.0" encoding="utf-8"?> tag.
Is there a correct way to handle all of these cases? Is there a better solution than the following?
dammit = UnicodeDammit(raw_html)
try:
doc = fromstring(dammit.unicode_markup)
except ValueError:
doc = fromstring(raw_html)
lxml has several issues related to handling Unicode. It might be best to use bytes (for now) while specifying the character encoding explicitly:
#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit
for filename in glob.glob('*.html'):
with open(filename, 'rb') as file:
content = file.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
title = root.find('.//title').text_content()
print(title)
Output
Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’
The problem probably stems from the fact that <meta charset> is a relatively new standard (HTML5 if I'm not mistaken, or it wasn't really used before it.)
Until such a time when the lxml.html library is updated to reflect it, you will need to handle that case specially.
If you only care about ISO-8859-* and UTF-8, and can afford to throw away non-ASCII compatible encodings (such as UTF-16 or the East Asian traditional charsets), you can do a regular expression substitution on the byte string, replacing the newer <meta charset> with the older http-equiv format.
Otherwise, if you need a proper solution, your best bet is to patch the library yourself (and contributing the fix while you're at it.) You might want to ask the lxml developers if they have any half-baked code laying around for this particular bug, or if they are tracking the bug on their bug tracking system in the first place.
Related
I am trying to render some basic umlauts with jinja2.
test.html
<!doctype html>
<link type="text/css" rel="stylesheet" href="style.css"/>
<meta charset="UTF-8">
<h3>Umlauts: ä ü ö</h3>
Result.html
<!doctype html>
<link type="text/css" rel="stylesheet" href="style.css"/>
<meta charset="UTF-8">
<h3>Umlauts: ä ü ö</h3>
My code
from jinja2 import Template
file = open("test.html")
data = file.read()
Template(data).stream().dump("index.html")
Now I don't understand how to get jinja to process the umlauts correctly. How can I do this? I am using stream, because in my actual usecase I am providing some data to fill in, then dumping it to an html to be displayed.
EDIT: Is what I want even possible? As I understand this from here it is not?
It is not possible to use Jinja2 to process non-Unicode data. The
reason for this is that Jinja2 uses Unicode already on the language
level. For example Jinja2 treats the non-breaking space as valid
whitespace inside expressions which requires knowledge of the encoding
or operating on an Unicode string.
With Python3 you can specify the encoding with open.
from jinja2 import Template
file = open("test.html", 'r', encoding='utf-8')
data = file.read()
Template(data).stream().dump('index.html')
For Python2 you can use the io module to specify encoding.
import io
file = io.open("test.html", 'r', encoding='utf-8')
I am using BeautifulSoup to scrape data from a Chinese online publishing website, and this is the URL to one of the novels http://www.jjwxc.net/onebook.php?novelid=1485737.
I have tried different encoding and decoding schemes (e.g., gb2312, utf-8) and their combinations to read the website. For example
import requests
from bs4 import BeautifulSoup
url = "http://www.jjwxc.net/onebook.php?novelid=1485737"
response = requests.get(url)
text = response.text
print text.encode('gb2312')
>> UnicodeEncodeError: 'gb2312' codec can't encode character u'\xa1' in position 340: illegal multibyte sequence
print text.encode('utf-8')
>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
<title>¡¶£¨Õý°æ£©±¼Ô¡·Êñ¿Í_¡¾Ô´´Ð¡Ëµ|ÑÔÇéС˵¡¿_½ú½ÎÄѧ³Ç</title>
<meta name="Keywords" content="Êñ¿Í,£¨Õý°æ£©±¼ÔÂ,Êñ¿Í¡¶£¨Õý°æ£©±¼Ô¡·,Ö÷½Ç£ºÁøÉÒ ©§ Åä½Ç£ºÔ£¬Â½À룬ËÕÐÅ£¬°×ÒÂÚÄÇ£¬Âå¸è£¬×¿ÇïÏÒ£¬ÉÌÓñÈÝ£¬Ð»ÁîÆëµÈµÈ£¨³ö³¡ÅÅÃû£© ©§ ÆäËü£ºÏÉÏÀ£¬ÁøÉÒ£¬ÔÂÉñ£¬Éñ»°,ÇéÓжÀÖÓ Å°ÁµÇéÉî ÁéÒìÉñ¹Ö âêÈ»Èôʧ ×îиüÐÂ:2015-07-15 23:57:04 ×÷Æ·»ý·Ö£º193191456" />
Note that the document itself claims to be encoded using gb2312.
I took a tour in the forum and realized that there may be some problems in the encoding definition. If I try the following
import urllib2
html = urllib2.urlopen('http://www.jjwxc.net/onebook.php? novelid=1485737').read()
soup = BeautifulSoup(html)
soup.original_encoding
>> {windows-1252}
But
import chardet
chardet.detect(html)
gives
>> {'confidence': 0.0, 'encoding': None}
Can someone shine some light onto this problem? Thank you!
I used the method mentioned in how to decode and encode web page with python?, and found that it worked with most Chinese websites but the one that I am interested in.
Try this, it should do the work.
The GBK codec provides conversion to and from the Chinese GB18030/GBK/GB2312 encoding.
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import requests
from bs4 import BeautifulSoup
url = "http://www.jjwxc.net/onebook.php?novelid=1485737"
response = requests.get(url)
text = response.text
text = text.decode('gbk').encode('utf-8')
print text
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312"/>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" />
<title>隆露拢篓脮媒掳忙拢漏卤录脭脗隆路脢帽驴脥_隆戮脭颅麓麓脨隆脣碌|脩脭脟茅脨隆脣碌隆驴_陆煤陆颅脦脛脩搂鲁脟</title>
...
...
Please consider this:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
which renders nice text representation of xhtml string.
But, for same XHTML document with HTML5 doctype:
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
I get Exception:
xml.etree.ElementTree.ParseError: undefined entity: line 5, column 19
so the parser can't handle it, although I added nbsp to entities dict.
Same happens if I use lxml:
from lxml import etree
parser = etree.XMLParser(resolve_entities=False)
tree = etree.fromstring(xhtml, parser=parser)
print etree.tostring(tree, method='xml')
raises:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 26
although I've set the parser to ignore entities.
Why is this, and how to make parsing of XHTML files with HTML5 doctype declaration possible?
Partial solution for lxml is to use recoverer:
parser = etree.XMLParser(resolve_entities=False, recover=True)
but I'm still waiting for better one.
The problem here is, the Expat parser used behind the scenes won't usually report unknown entities - it will rather throw an error, so the fallback code in xml.etree.ElementTree you were trying to trigger won't even run. You can use the UseForeignDTD method to change this behavior, it will make Expat ignore the doctype declaration and pass all entity declarations to xml.etree.ElementTree. The following code works correctly:
import xml.etree.ElementTree as ET
xhtml = '''<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head><title>XHTML sample</title></head>
<body>
<p> Sample text</p>
</body>
</html>
'''
parser = ET.XMLParser()
parser._parser.UseForeignDTD(True)
parser.entity['nbsp'] = u'\u00A0'
tree = ET.fromstring(xhtml, parser=parser)
print(ET.tostring(tree, method='xml'))
The side-effect of this approach: as I said, the doctype declaration is completely ignored. This means that you have to declare all entities, even the ones supposedly covered by the doctype.
Note that the values you put into ElementTree.XMLParser.entity dictionary have to be regular strings, text that the entity will be replaced by - you can no longer refer to other entities there. So it should be u'\u00A0' for .
I thought it must be a bug so i issued a bug report here.
On the other hand i might be missing something so i need another look on the code.
The problem is, when i initialize BeautifulSoup with contents of an .xhtml file, xml definition gets two question marks at the end of it.
Can you reproduce the problem? Is there a way to avoid it? Am i missing a function, a method, an argument or something?
Edit0: It's BeautifulSoup 4 on python 2.x.
Edit1: Why downvote?
The problem:
<?xml version="1.0" encoding="UTF-8"??>
Terminal Output:
>>> from bs4 import BeautifulSoup as bs
>>> with open('example.xhtml', 'r') as f:
... txt = f.read()
... soup = bs(txt)
...
>>> print txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
>>> print soup
<?xml version="1.0" encoding="UTF-8"??>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
This is a bug. I've committed a fix which will be in the next release of Beautiful Soup.
The root cause:
The HTMLParser class uses the SGML syntactic rules for processing instructions. An XHTML processing instruction using the trailing '?' will cause the '?' to be included in data.
In general, you'll have better results parsing XHTML with the "xml" parser, as ThiefMaster suggested.
Consider using an XML parser:
soup = bs(txt, 'xml')
Using your contents of variable in 'txt' in an example.xhtml file I can't reproduce the issue with Python2.7 and the corrosponding BeautifulSoup module (not bs4). Works fine and dandy for me.
>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> print soup
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
</head>
<body>
</body>
</html>
What is the issue you are encountering with it, as in what is your end aim, maybe then somoone can suggest a workaround
I'm new to lxml and python. I'm trying to parse an html document. When I parse using the standard xml parser it will write the characters out correctly but I think it fails to parse as I have trouble searching it with xpath.
Example file being parsed:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>title</title>
</head>
<body>
<span id="demo">Garbléd charactérs</span>
</body>
</html>
Parsing code:
from lxml import etree
fname = 'output/so-help.html'
# parse
hparser = etree.HTMLParser()
htree = etree.parse(fname, hparser)
# garbled
htree.write('so-dumpu.html', encoding='utf-8')
# targets
demo_name = htree.xpath("//span[#id='demo']")
# garbled
print 'name: "' + demo_name[0].text
Terminal output:
name: "Garbléd charactérs
htree.write output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>title</title></head><body>
<span id="demo">Garbléd charactérs</span>
</body></html>
the problem was that you tried to encode an already encoded data, what you need is to let
parser decode the data with utf-8.
* in your original code try demo_name[0].text.decode('utf-8') and you will see
the right way to do it :
from lxml import etree
fname = 'output/so-help.html'
# parse
hparser = etree.HTMLParser(encoding='utf-8')
htree = etree.parse(fname, hparser)
# garbled
htree.write('so-dumpu.html')
# targets
demo_name = htree.xpath("//span[#id='demo']")
# garbled
print 'name: "' + demo_name[0].text
Try changing the output encoding:
htree.write('so-dumpu.html', encoding='latin1')
and
print 'name: "' + demo_name[0].text.encode('latin1')
I assume your XHTML document is encoded in utf-8. The issue is that the encoding is not specified in the HTML document. By default, browsers and lxml.html assume HTML documents are encoded in ISO-8859-1, that's why your document is incorrectly parsed. If you open it in your browser, it will also be displayed incorrectly.
You can specify the encoding of your document like this :
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>title</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
You can force the encoding used by lxml this way (like your can change the encoding used in your browser) :
file = open(fname)
filecontents = file.read()
filecontents = filecontents.decode("utf-8")
htree = lxml.html.fromstring(filecontents)
print htree.xpath("//span[#id='demo']")[0].text