Why am I getting a UnicodeDecodeError in Python's JSON encoding? - python

I am using Solr 3.3 to index stuff from my database. I compose the JSON content in Python. I manage to upload 2126 records which add up to 523246 chars (approx 511kb). But when I try 2027 records, Python gives me the error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "D:\Technovia\db_indexer\solr_update.py", line 69, in upload_service_details
request_string.append(param_list)
File "C:\Python27\lib\json\__init__.py", line 238, in dumps
**kw).encode(obj)
File "C:\Python27\lib\json\encoder.py", line 203, in encode
chunks = list(chunks)
File "C:\Python27\lib\json\encoder.py", line 425, in _iterencode
for chunk in _iterencode_list(o, _current_indent_level):
File "C:\Python27\lib\json\encoder.py", line 326, in _iterencode_list
for chunk in chunks:
File "C:\Python27\lib\json\encoder.py", line 384, in _iterencode_dict
yield _encoder(value)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 68: invalid start byte
Ouch. Is 512kb worth of bytes a fundamental limit? Is there any high-volume alternative to the existing JSON module?
Update: its a fault of some data as trying to encode *biz_list[2126:]* causes an immediate error. Here is the offending piece:
'2nd Floor, Gurumadhavendra Towers,\nKadavanthra Road, Kaloor,\nCochin \x96 682 017'
How can I configure it so that it can be encodable into JSON?
Update 2: The answer worked as expected: the data came from a MySQL table encoded in "latin-1-swedish-ci". I saw significance in a random number. Sorry for spontaneously channeling the spirit of a headline writer when diagnosing the fault.

Simple, just don't use utf-8 encoding if your data is not in utf-8
>>> json.loads('["\x96"]')
....
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 0: invalid start byte
>>> json.loads('["\x96"]', encoding="latin-1")
[u'\x96']
json.loads
If s is a str instance and is encoded with an ASCII based
encoding other than utf-8 (e.g. latin-1) then an appropriate
encoding name must be specified. Encodings that are not ASCII
based (such as UCS-2) are not allowed and should be decoded to
unicode first.
Edit: To get proper unicode value of "\x96" use "cp1252" as Eli Collins mentioned
>>> json.loads('["\x96"]', encoding="cp1252")
[u'\u2013']

Related

UnicodeDecodeError while cursor.fetchall()

I was trying to export some data from Google Cloud SQL database to an excel file using Python xlsxwriter , webapp2, appengine in a deferred task.
The data to be written has to be retrieved from the database.
The query is executing fine but when I try to fetch the data from the query either using cursor.fetchall() or by iterating over cursor it is throwing the following error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 4: invalid start byte
The stacktrace is :
for row in cursor:
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 689, in fetchone
self._FetchMoreRows()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 606, in _FetchMoreRows
self._DoExec(request)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 448, in _DoExec
return self._HandleResult(response.result)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 487, in _HandleResult
new_rows = self._GetRows(result)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 532, in _GetRows
tuple_proto.values[value_index]))
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/rdbms.py", line 402, in _DecodeVariable
return converter(value)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/storage/speckle/python/api/converters.py", line 126, in Str2Unicode
return unicode(arg, 'utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9f in position 4: invalid start byte
The same code works if I try to locally run it using MySQLdb instead of rdbms.
There could be some encoding issue in data but that should come up while writing to the file.
I tried finding some data that may be corrupt but was not able to find any.
You do not have text that is encoded in utf8. What is it encoded in? If it is latin1, the the 9F stands for Ÿ; would that make sense? Find the line with the hex 9f; let's see the context.

Python ignores encoding argument in favor of cp1252

I have a lengthy json file that contains utf-8 characters (and is encoded in utf-8). I want to read it in python using the built-in json module.
My code looks like this:
dat = json.load(open("data.json"), "utf-8")
Though I understand the "utf-8" argument should be unnecessary as it is assumed as the default. However, I get this error:
Traceback (most recent call last):
File "winratio.py", line 9, in <module>
dat = json.load(open("data.json"), "utf-8")
File "C:\Python33\lib\json\__init__.py", line 271, in load
return loads(fp.read(),
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 28519: ch
aracter maps to <undefined>
My question is: Why does python seem to ignore my encoding specification and try to load the file in cp1252?
Try this:
import codecs
dat = json.load(codecs.open("data.json", "r", "utf-8"))
Also here are described some tips about a writing mode in context of the codecs library: Write to UTF-8 file in Python

How to Parse HTML with Non-ASCII Characters using BeautifulSoup?

I keep getting the following error when trying to parse some html using BeautifulSoup:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
I've tried decoding the html using the solution to the questions below, but keep getting the same error. I've tried all the solutions to the questions below but none of them work (posting so that I don't get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).
Anybody know where I'm going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?
EDIT: code and traceback below:
from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
EDIT: error message per comment below:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
self._feed()
File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
SGMLParser.feed(self, markup)
File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)
Thanks for your help!
'ascii' codec error in beautifulsoup
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
How do I convert a file's format from Unicode to ASCII using Python?
python UnicodeEncodeError > How can I simply remove troubling unicode characters?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
You say in a comment: """I just looked up the content-type of the html I'm trying to parse to see if it was something I hadn't tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end."""
Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.
Either divulge the link to your HTML, or do some basic debugging:
Does uc = html.decode('utf8') work or fail? If fail, with what error message?
You also said: """I'm starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html."""
I can't imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.
Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (®) in an attribute value AND with 128 <= ordinal < 255.
Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch -- BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.
I tried to use pyquery on the html and the result is fine.
import urllib
from pyquery import PyQuery
html = urllib.urlopen('http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81').read()
pq = PyQuery(html)
print pq('span#price').text() # "$39.00 40% off MSRP $65.00"
pyquery is based on lxml so it's also much faster than beautifulsoup.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

I want to parse my XML document. So I have stored my XML document as below
class XMLdocs(db.Expando):
id = db.IntegerProperty()
name=db.StringProperty()
content=db.BlobProperty()
Now my below is my code
parser = make_parser()
curHandler = BasketBallHandler()
parser.setContentHandler(curHandler)
for q in XMLdocs.all():
parser.parse(StringIO.StringIO(q.content))
I am getting below error
'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
handler.post(*groups)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
self.handle()
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
scan_aborted = not self.process_entity(entity, ctx)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
handler(entity)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
parser.parse(StringIO.StringIO(q.content))
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters
print ch
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.
The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:
print ch #fails
print ch.encode('ascii', 'ignore')
The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.
Just putting .encode('utf-8') at the end of object will do the job in recent versions of Python.
It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:
import codecs
content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))
I used strip instead of lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.
This worked for me:
from django.utils.encoding import smart_str
content = smart_str(content)
The problem according to your traceback is the print statement on line 136 of parseXML.py. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:
print repr(ch)
then you should at least see what you are trying to print.
The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace').
An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example
import sys
reload(sys)
sys.setdefaultencoding('utf8')

UnicodeDecodeError reading string in CSV

I'm having a problem reading some chars in python.
I have a csv file in UTF-8 format, and I'm reading, but when script read:
Preußen Münster-Kaiserslautern II
I get this error:
Traceback (most recent call last):
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 515, in __call__
handler.get(*groups)
File "/Users/fermin/project/gae/cuotastats/controllers/controllers.py", line 50, in get
f.name = unicode( row[1])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
I tried to use Unicode functions and convert string to Unicode, but I haven't found the solution. I tried to use sys.setdefaultencoding('utf8') but that doesn't work either.
Try the unicode_csv_reader() generator described in the csv module docs.

Categories