Python error with decode utf-8 and Japanese characters - python

Traceback (most recent call last):
File "C:\Program Files (x86)\Python\Projects\test.py", line 70, in <module>
html = urlopen("https://www.google.co.jp/").read().decode('utf-8')
File "C:\Program Files (x86)\Python\lib\http\client.py", line 506, in read
return self._readall_chunked()
File "C:\Program Files (x86)\Python\lib\http\client.py", line 592, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "C:\Program Files (x86)\Python\lib\http\client.py", line 664, in _safe_read
raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(5034 bytes read, 3158 more expected)
So I am trying to get data from a website but it seems whenever it comes across Japanese characters or other unreadable characters it comes up with this error. All I am using is urlopen and .read().decode('utf-8'). Is there some way I can just ignore all of them or replace them all so there is no error?

In the code you posted, there is no problem with character encoding. Instead you have a problem getting the whole HTTP response. (Look closely at the error message.)
I tried this in an interactive Python shell:
>>> import urllib2
>>> url = urllib2.urlopen("https://www.google.co.jp/")
>>> body = url.read()
>>> len(body)
11155
This worked.
>>> body.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 102: invalid start byte
Ok, there is indeed an encoding error.
>>> url.headers['Content-Type']
'text/html; charset=Shift_JIS'
This is because your HTTP response is not encoded in UTF-8, but in Shift-JIS.
You should probably not use urllib2 but a higher level library that takes care of the HTTP encoding. Or, if you want to do it yourself, see https://stackoverflow.com/a/20714761.

Use requests and BeautifulSoup:
import requests
r = requests.get("https://www.google.co.jp/")
soup = BeautifulSoup(r.content)
print soup.find_all("p")
[<p style="color:#767676;font-size:8pt">© 2013 - プライバシーと利用規約</p>]

Related

python, vobject, encoding, vcards

I am using vobject in python. I am attempting to parse the vcard located here:
http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150
to do this, I do the following:
import urllib
import vobject
vcard = urllib.urlopen("http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150").read()
vcard_object = vobject.readOne(vcard)
Whenever I do this, I get the following error:
Traceback (most recent call last):
File "<pyshell#86>", line 1, in <module>
vobject.readOne(urllib.urlopen("http://www.mayerbrown.com/people/vCard.aspx?Attorney=1150").read())
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 1078, in readOne
ignoreUnreadable, allowQP).next()
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 1031, in readComponents
vline = textLineToContentLine(line, n)
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 888, in textLineToContentLine
return ContentLine(*parseLine(text, n), **{'encoded':True, 'lineNumber' : n})
File "C:\Python27\lib\site-packages\vobject-0.8.1c-py2.7.egg\vobject\base.py", line 262, in __init__
self.value = str(self.value).decode('quoted-printable')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 29: ordinal not in range(128)
I have tried a number of other variations on this, such as converting vcard into unicode, using various encodings,etc. But I always get the same, or a very similar, error message.
Any ideas on how to fix this?
It's failing on line 13 of the vCard because the ADR property is incorrectly marked as being encoded in the "quoted-printable" encoding. The ü character should be encoded as =FC, which is why vobject is throwing the error.
File is downloaded as UTF-8 (i think) encoded string, but library tries to interpret it as ASCII.
Try adding following line after urlopen:
vcard = vcard.decode('utf-8')
vobject library readOne method is pretty awkward.
To avoid problems I decided to persist in my database the vcards in form of quoted-printable data, which the one likes.
assuming some_vcard is string with UTF-8 encoding
quopried_vcard = quopri.encodestring(some_vcard)
and the quopried_vcard gets persisted, and when needed just:
vobj = vobject.readOne(quopried_vcard)
and then to get back decoded data, e.g for fn field in vcard:
quopri.decodestring(vobj.fn.value)
Maybe somebody can handle UTF-8 with readOne better. If yes I would love to see it.

How to handle UnicodeDecodeError without losing any data?

I am using Python & lxml and am stuck with an error
My code
>>>import urllib
>>>from lxml import html
>>>response = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Grapevine/GrapevineFordLincoln_1/fullservice-505318162.html').read()
>>>dom = html.fromstring(response)
>>>dom.xpath("//div[#class='description item vcard']")[0].xpath(".//p[#class='service-review-paragraph loose-spacing']")[0].text_content()
Traceback
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/lxml/html/__init__.py", line 249, in text_content
return _collect_string_content(self)
File "xpath.pxi", line 466, in lxml.etree.XPath.__call__ (src/lxml/lxml.etree.c:119105)
File "xpath.pxi", line 242, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:116936)
File "extensions.pxi", line 552, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:112473)
File "apihelpers.pxi", line 1344, in lxml.etree.funicode (src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 477: invalid start byte
The problem is the special character which is present in the div I am fetching. How can I encode/decode the text without losing any data?
The parser assumes this is a utf-8 file, but it's not. the simplest thing to do would be to convert it to unicode first, by knowing the encoding of the page
>>> url = urllib.urlopen('http://www.edmunds.com/dealerships/Texas/Grapevine/GrapevineFordLincoln_1/fullservice-505318162.html')
>>> url.headers.get('content-type')
'text/html; charset=ISO-8859-1'
>>> response = url.read()
#let's convert to unicode first
>>> response_unicode = codecs.decode(response, 'ISO-8859-1')
>>> dom = html.fromstring(response_unicode)
#and now...
>>> dom.xpath("//div[#class='description item vcard']")[0].xpath(".//p[#class='service-review-paragraph loose-spacing']")[0].text_content()
u'\n On December 5th, my vehicle completely shut down.\nI had it towed to Grapevine Ford where they told me that the intak.....
tada!
So it looks like the page is corrupted. It has UTF-8 encoding specified, but is not valid in that encoding.
urlopen(...).read() returns us a byte string (str). When you feed it to lxml, it tries to decode it with UTF-8 and fails.
This might not be the best way, but we can specify a different encoding manually, such as Latin-1:
response = urllib.urlopen(...).read().decode('latin-1')
Now response is a text string (unicode), and that's what LXML wants to work with.

Beautiful Soup raises UnicodeEncodeError "ordinal not in range(128)"

I am trying to parse arbitrary documents download from the wild web, and yes, I have no control of their content.
Since Beautiful Soup won't choke if you give it bad markup... I wonder why does it giving me those hick-ups when sometimes, part of the doc is malformed, and whether there is a way to make it resume to next readable portion of the doc, regardless of this error.
The line where the error occurred is the 3rd one:
from BeautifulSoup import BeautifulSoup as doc_parser
reader = open(options.input_file, "rb")
doc = doc_parser(reader)
CLI full output is:
Traceback (most recent call last):
File "./grablinks", line 101, in <module>
sys.exit(main())
File "./grablinks", line 88, in main
links = grab_links(options)
File "./grablinks", line 36, in grab_links
doc = doc_parser(reader)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1519, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1144, in __init__
self._feed(isHTML=isHTML)
File "/usr/local/lib/python2.7/dist-packages/BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/usr/lib/python2.7/sgmllib.py", line 143, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/sgmllib.py", line 320, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python2.7/sgmllib.py", line 358, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
Yeah, It will choke if you have elements with non-ASCII names (<café>). And that's not even ‘bad markup’, for XML...
It's a bug in sgmllib which BeautifulSoup is using: it tries to find custom methods with the same names as tags, but in Python 2 method names are byte strings so even looking for a method with a non-ASCII character in, which will never be present, fails.
You can hack a fix into sgmllib by changing lines 259 and 371 from except AttributeError: to except AttributeError, UnicodeError: but that's not really a good fix. Not trivial to override the rest of the method either.
What is it you're trying to parse? BeautifulStoneSoup was always of questionable usefulness really—XML doesn't have the wealth of ghastly parser hacks that HTML does, so in general broken XML isn't XML. Consequently you should generally use a plain old XML parser (eg use a standard DOM or etree). For parsing general HTML, html5lib is your better option these days.
This happens if there are non-ascii chars in the input in python versions before Python 3.0
If you are trying to use str(...)on a string containing chars with a char value > 128 (ANSII & unicode), this exception is raised.
Here, the error possibly occurs because getattr tries to use str on a unicode string - it "thinks" it can safely do this because in python versions prior to 3.0 identifiers must not contain unicode.
Check your HTML for unicode characters. Try to replace / encode these and if it still does not work, tell us.

How to Parse HTML with Non-ASCII Characters using BeautifulSoup?

I keep getting the following error when trying to parse some html using BeautifulSoup:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
I've tried decoding the html using the solution to the questions below, but keep getting the same error. I've tried all the solutions to the questions below but none of them work (posting so that I don't get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).
Anybody know where I'm going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?
EDIT: code and traceback below:
from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
EDIT: error message per comment below:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
Thanks for your help!
'ascii' codec error in beautifulsoup
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
How do I convert a file's format from Unicode to ASCII using Python?
python UnicodeEncodeError > How can I simply remove troubling unicode characters?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
You say in a comment: """I just looked up the content-type of the html I'm trying to parse to see if it was something I hadn't tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end."""
Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.
Either divulge the link to your HTML, or do some basic debugging:
Does uc = html.decode('utf8') work or fail? If fail, with what error message?
You also said: """I'm starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html."""
I can't imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.
Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (®) in an attribute value AND with 128 <= ordinal < 255.
Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch -- BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.
I tried to use pyquery on the html and the result is fine.
import urllib
from pyquery import PyQuery
html = urllib.urlopen('http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81').read()
pq = PyQuery(html)
print pq('span#price').text() # "$39.00 40% off MSRP $65.00"
pyquery is based on lxml so it's also much faster than beautifulsoup.

UnicodeDecodeError is raised when getting a cookie in Google App Engine

I have a GAE project in Python where I am setting a cookie in one of my RequestHandlers with this code:
self.response.headers['Set-Cookie'] = 'app=ABCD; expires=Fri, 31-Dec-2020 23:59:59 GMT'
I checked in Chrome and I can see the cookie listed, so it appears to be working.
Then later in another RequestHandler, I get the cookie to check it:
appCookie = self.request.cookies['app']
This line gives the following error when executed:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
It seems that it is trying to decode the incoming cookie info using an ASCII codec rather than UTF-8.
How do I force Python to use UTF-8 to decode this?
Are there any other Unicode-related gotchas that I need to be aware of as a newbie to Python and Google App Engine (but an experienced programmer in other languages)?
Here is the full Traceback:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4144, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 4049, in _Dispatch
base_env_dict=env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 616, in Dispatch
base_env_dict=base_env_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3120, in Dispatch
self._module_dict)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 3024, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/tools/dev_appserver.py", line 2887, in ExecuteOrImportScript
exec module_code in script_module.__dict__
File "/Users/ken/hgdev/juicekit/main.py", line 402, in <module>
main()
File "/Users/ken/hgdev/juicekit/main.py", line 399, in main
run_wsgi_app(application)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 721, in __call__
response.wsgi_write(start_response)
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 296, in wsgi_write
body = self.out.getvalue()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1962: ordinal not in range(128)
You're looking to use the decode function somewhat like this (cred #agf:):
self.request.cookies['app'].decode('utf-8')
From official python documentation (plus a couple added details):
Python’s 8-bit strings have a .decode([encoding], [errors]) method that interprets the string using the given encoding. The following example shows the string as it goes to unicode and then back to 8-bit string:
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> type(u), u # Examine
(<type 'unicode'>, u'\ua000abcd\u07b4')
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version # Examine
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
First, encode any unicode value you set in the cookies. You also need to quote them in case they can break the header:
import urllib
# This is the value we want to set.
initial_value = u'äëïöü'
# WebOb version that comes with SDK doesn't quote cookie values
# in the Response, neither webapp.Response. So we have to do it.
quoted_value = urllib.quote(initial_value.encode('utf-8'))
rsp = webapp.Response()
rsp.headers['Set-Cookie'] = 'app=%s; Path=/' % quoted_value
Now let's read the value. To test it, create a fake Request to test the cookie we have set. This code was extracted from a real unittest:
cookie = rsp.headers.get('Set-Cookie')
req = webapp.Request.blank('/', headers=[('Cookie', cookie)])
# The stored value is the same quoted value from before.
# Notice that here we use .str_cookies, not .cookies.
stored_value = req.str_cookies.get('app')
self.assertEqual(stored_value, quoted_value)
Our value is still encoded and quoted. We must do the reverse to get the initial one:
# And we can get the initial value unquoting and decoding.
final_value = urllib.unquote(stored_value).decode('utf-8')
self.assertEqual(final_value, initial_value)
If you can, consider using webapp2. webob.Response does all the hard work of quoting and setting cookies, and you can set unicode values directly. See a summary of these issues here.

Categories