json.dumps(pickle.dumps(u'å')) raises UnicodeDecodeError - python

Is this a bug?
>>> import json
>>> import cPickle
>>> json.dumps(cPickle.dumps(u'å'))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/encoder.py", line 361, in encode
return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data

The json module is expecting strings to encode text. Pickled data isn't text, it's 8-bit binary.
One simple workaround, if you really need to send pickled data over JSON, is to use base64:
j = json.dumps(base64.b64encode(cPickle.dumps(u'å')))
cPickle.loads(base64.b64decode(json.loads(j)))
Note that this is very clearly a Python bug. Protocol version 0 is explicitly documented as ASCII, yet å is sent as the non-ASCII byte \xe5 instead of encoding it as "\u00E5". This bug was reported upstream--and the ticket was closed without the bug being fixed. http://bugs.python.org/issue2980

Could be a bug in pickle. My python documentation says (for used pickle format): Protocol version 0 is the original ASCII protocol and is backwards compatible with earlier versions of Python. [...] If a protocol is not specified, protocol 0 is used.
>>> cPickle.dumps(u'å').decode('ascii')
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 1: ordinal not in range(128)
that aint no ASCII
and, don't know whether its relevant, or even a problem:
>>> cPickle.dumps(u'å') == pickle.dumps(u'å')
False

I'm using Python2.6 and your code runs without any error.
In [1]: import json
In [2]: import cPickle
In [3]: json.dumps(cPickle.dumps(u'å'))
Out[3]: '"V\\u00e5\\np1\\n."'
BTW, what's your system default encoding, in my case, it's
In [6]: sys.getdefaultencoding()
Out[6]: 'ascii'

Related

Python 2.7 String decode failed.

I expect the following code works fine, but it's failing, what is the reason?
>>> s = 'ö'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
In the interactive interpreter, the encoding of a string literal depends entirely on your terminal or console configuration. In your case, that is not set to UTF-8.
You can use the sys.stdin.encoding attribute to determine what codec to use:
>>> s = 'ö'
>>> import sys
>>> s.decode(sys.stdin.encoding)
u'\xf6'
Alternatively, just create a unicode string literal (using the u prefix) directly; the Python interactive interpreter knows to use the sys.stdin.encoding codec for that case:
>>> s = u'ö'
>>> s
u'\xf6'

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 2: ordinal not in range(128)

I installed pywikibot-core (version 2.0b3) for my Mediawiki installation. I got an error when i tried to run a command which contains Unicode text.
I run the following command:
python pwb.py replace.py -regex -start:! "\[মুয়ায্যম হুসায়ন খান\]" "[মুয়ায্‌যম হুসায়ন খান]" -summary:"fix: মুয়ায্যম > মুয়ায্‌যম"
Here is the error i got:
Traceback (most recent call last):
File "pwb.py", line 161, in <module>
import pywikibot # noqa
File "/var/www/html/banglapedia_bn/core/pywikibot/__init__.py", line 32, in <module>
from pywikibot import config2 as config
File "/var/www/html/banglapedia_bn/core/pywikibot/config2.py", line 285, in <module>
if arg.startswith("-verbose") or arg == "-v":
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 2: ordinal not in range(128)
Use python3 instead of python.
You are seeing that error because the module config2.py uses from __future__ import unicode_literals, making all strings in the module unicode objects. However, sys.args is a bytestring, and is not affected by __future__ imports.
Therefore, because arg is a byte string, but "-verbose" and "-v" are two unicode strings, arg gets implicitly promoted to unicode, but this is failing because implicit conversion only works with ASCII.
Instead, in Python 3, all strings are unicode by default, including sys.args.

Another python unicode error

I'm getting errors such as
UnicodeEncodeError('ascii', u'\x01\xff \xfeJ a z z', 1, 2, 'ordinal not in range(128)'
I'm also getting sequences such as
u'\x17\x01\xff \xfeA r t B l a k e y'
I recognize \x01\xff\xfe as a BOM, but how do I transform these into the obvious output (Jazz and Art Blakey)?
These are coming from a program that reads music file tags.
I've tried various encodings, such a s.encode('utf8'), and various decodes followed by encodes, without success.
As requested:
from hsaudiotag import auto
inf = 'test.mp3'
song = auto.File(inf)
print song.album, song.artist, song.title, song.genre
> Traceback (most recent call last): File "audio2.py", line 4, in
> <module>
> print song.album, song.artist, song.title, song.genre File "C:\program files\python27\lib\encodings\cp437.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\xfe' in
> position 4 : character maps to <undefined>
If I change the print statement to
with open('x', 'wb') as f:
f.write(song.genre)
I get
Traceback (most recent call last):
File "audio2.py", line 6, in <module>
f.write(song.genre)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 1:
ordinal not in range(128)
For your actual question, you need to write bytes, not characters, to files. Call:
f.write(song.genre.encode('utf-8'))
and you won't get the error. You can use io.open to get a character stream that you can write to with the encoding done automatically, ie:
with io.open('x', 'wb', encoding='utf-8') as f:
f.write(song.genre)
Getting Unicode to the Console can be a matter of some difficulty (under Windows in particular)—see PrintFails.
However, as discussed in the comments, what you've got doesn't look like a working tag value... it looks more like an mangled ID3v2 frame data block, which it might not be possible to recover. I don't know if this is a bug in your tag reading library or you just have a file with rubbish tags.

Python : UnicodeEncodeError: 'latin-1' codec can't encode character

I am at a scenario where I call api and based on the results from api I call database for each record that I in api. My api call return strings and when I make the database call for the items return by api, for some elements I get the following error.
Traceback (most recent call last):
File "TopLevelCategories.py", line 267, in <module>
cursor.execute(categoryQuery, {'title': startCategory});
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
query = query % db.literal(args)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
return self.escape(o, self.encoders)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
The segment of my code the above error is referring is:
...
for startCategory in value[0]:
categoryResults = []
try:
categoryRow = ""
baseCategoryTree[startCategory] = []
#print categoryQuery % {'title': startCategory};
cursor.execute(categoryQuery, {'title': startCategory}) #unicode issue
done = False
cont...
After doing some google search I tried the following on my command line to understand whats going on...
>>> import sys
>>> u'\u2013'.encode('iso-8859-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 0: ordinal not in range(256)
>>> u'\u2013'.encode('cp1252')
'\x96'
>>> '\u2013'.encode('cp1252')
'\\u2013'
>>> u'\u2013'.encode('cp1252')
'\x96'
But I am not sure what would be the solution to overcome this issue. Also I don't know what is the theory behind encode('cp1252') it would be great if I can get some explanation on what I tried above.
If you need Latin-1 encoding, you have several options to get rid of the en-dash or other code points above 255 (characters not included in Latin-1):
>>> u = u'hello\u2013world'
>>> u.encode('latin-1', 'replace') # replace it with a question mark
'hello?world'
>>> u.encode('latin-1', 'ignore') # ignore it
'helloworld'
Or do your own custom replacements:
>>> u.replace(u'\u2013', '-').encode('latin-1')
'hello-world'
If you aren't required to output Latin-1, then UTF-8 is a common and preferred choice. It is recommended by the W3C and nicely encodes all Unicode code points:
>>> u.encode('utf-8')
'hello\xe2\x80\x93world'
The unicode character u'\02013' is the "en dash". It is contained in the Windows-1252 (cp1252) character set (with the encoding x96), but not in the Latin-1 (iso-8859-1) character set. The Windows-1252 character set has some more characters defined in the area x80 - x9f, among them the en dash.
The solution would be for you to choose a different target character set than Latin-1, such as Windows-1252 or UTF-8, or to replace the en dash with a simple "-".
u.encode('utf-8') converts it to bytes which can then be printed on stdout using sys.stdout.buffer.write(bytes)
checkout the displayhook on
https://docs.python.org/3/library/sys.html

Problem with encode decode. Python. Django. BeautifulSoup

In this code:
soup=BeautifulSoup(program.Description.encode('utf-8'))
name=soup.find('div',{'class':'head'})
print name.string.decode('utf-8')
error happening when i'm trying to print or save to database.
dosnt metter what i'm doing:
print name.string.encode('utf-8')
or just
print name.string
Traceback (most recent call last):
File "./manage.py", line 16, in <module>
execute_manager(settings)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 362, in execute_manager
utility.execute()
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/__init__.py", line 303, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 195, in run_from_argv
self.execute(*args, **options.__dict__)
File "/usr/local/cluster/dynamic/virtualenv/lib/python2.5/site-packages/django/core/management/base.py", line 222, in execute
output = self.handle(*args, **options)
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 50, in handle
self.FirstTimeLoad()
File "/usr/local/cluster/dynamic/website/video/remmedia/management/commands/remmedia.py", line 115, in FirstTimeLoad
print name.string.decode('utf-8')
File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5: ordinal not in range(128)
This is repr(name.string)
u'\u0412\u044b\u043f\u0443\u0441\u043a \u043e\u0442 27 \u0434\u0435\u043a\u0430\u0431\u0440\u044f'
I don't know what you are trying to do with name.string.decode('utf-8'). As the BeautifulSoup documentation eloquently points out, "BeautifulSoup gives you Unicode, dammit". So name.string is already decoded - it is in unicode. You can encode it back to utf-8 if you want to, but you can't decode it any further.
You can try:
print name.string.encode('ascii', 'replace')
The output should be accepted whatever the encoding of sys.stdout is (including None).
In fact, the file-like object that you are printing to might not accept UTF-8. Here is an example: if you have the apparently benign program
# -*- coding: utf-8 -*-
print u"hérisson"
then running it in a terminal that can print accented characters works fine:
lebigot#weinberg /tmp % python2.5 test.py
hérisson
but printing to a standard output connected to a Unix pipe does not:
lebigot#weinberg /tmp % python2.5 test.py | cat
Traceback (most recent call last):
File "test.py", line 3, in <module>
print u"hérisson"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
because sys.stdout has encoding None, in this case: Python considers that the program that reads through the pipe should receive ASCII, and the printing fails because ASCII cannot represent the word that we want to print. A solution like the one above solves the problem.
Note: You can check the encoding of your standard output with:
print sys.stdout.encoding
This can help you debug encoding problems.
Edit: name.string comes from BeautifulSoup, so it is presumably already a unicode string.
However, your error message mentions 'ascii':
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-5:
ordinal not in range(128)
According to the PrintFails Python wiki page, if Python does not know or
can not determine what kind of encoding your output device is expecting, it sets
sys.stdout.encoding to None and print attempts to encode its arguments with
the 'ascii' codec.
I believe this is the cause of your problem. You can can confirm this by seeing
if print sys.stdout.encoding prints None.
According to the same page, linked above, you can circumvent the problem by
explicitly telling Python what encoding to use. You do that be wrapping
sys.stdout in an instance of StreamWriter:
For example, you could try adding
import sys
import locale
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)
to your script before the print statement. You may have to change
locale.getpreferredencoding() to and explicit encoding (e.g. 'utf-8',
'cp1252', etc.). The right encoding to use depends on your output device.
It should be set to whatever encoding your output device is expecting. If
you are outputing to a terminal, the terminal may have a menu setting to allow
the user to set what type of encoding the terminal should expect.
Original answer: Try:
print name.string
or
print name.string.encode('utf-8')
try
text = text.decode("utf-8", "replace")

Categories