I expect the following code works fine, but it's failing, what is the reason?
>>> s = 'ö'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
In the interactive interpreter, the encoding of a string literal depends entirely on your terminal or console configuration. In your case, that is not set to UTF-8.
You can use the sys.stdin.encoding attribute to determine what codec to use:
>>> s = 'ö'
>>> import sys
>>> s.decode(sys.stdin.encoding)
u'\xf6'
Alternatively, just create a unicode string literal (using the u prefix) directly; the Python interactive interpreter knows to use the sys.stdin.encoding codec for that case:
>>> s = u'ö'
>>> s
u'\xf6'
Related
I installed pywikibot-core (version 2.0b3) for my Mediawiki installation. I got an error when i tried to run a command which contains Unicode text.
I run the following command:
python pwb.py replace.py -regex -start:! "\[মুয়ায্যম হুসায়ন খান\]" "[মুয়ায্যম হুসায়ন খান]" -summary:"fix: মুয়ায্যম > মুয়ায্যম"
Here is the error i got:
Traceback (most recent call last):
File "pwb.py", line 161, in <module>
import pywikibot # noqa
File "/var/www/html/banglapedia_bn/core/pywikibot/__init__.py", line 32, in <module>
from pywikibot import config2 as config
File "/var/www/html/banglapedia_bn/core/pywikibot/config2.py", line 285, in <module>
if arg.startswith("-verbose") or arg == "-v":
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 2: ordinal not in range(128)
Use python3 instead of python.
You are seeing that error because the module config2.py uses from __future__ import unicode_literals, making all strings in the module unicode objects. However, sys.args is a bytestring, and is not affected by __future__ imports.
Therefore, because arg is a byte string, but "-verbose" and "-v" are two unicode strings, arg gets implicitly promoted to unicode, but this is failing because implicit conversion only works with ASCII.
Instead, in Python 3, all strings are unicode by default, including sys.args.
I have some Japaneses character in my nova.conf file.
クラ
After reading from config file I have to decode it in utf-8 like
my_data = CONF.test.test
my_data = my_data.decode('utf-8')
When I use variable without decode, its giving UnicodeDecodeError
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
Is there any way to read data from config directly in decoded form?
Is there any way to read data from config directly in decoded form?
not in python 2.7. Because in python prior to python 3, strings default to ASCII strings, whereas in python 3 strings default to unicode strings. So basically:
>>> mydata = "クラ"
>>> print mydata.decode('utf-8')
クラ
>>> print mydata
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
whereas in python 3:
>>> mydata = "クラ"
>>> print(mydata)
クラ
So if you want to handle unicode strings painlessly, it's time for you to do the switch.
>>> unicode('восстановление информации', 'utf-16')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
>>> unicode('восстановление информации', 'utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
Why do these Russian words encode in UTF-8 fine, but not UTF-16?
You are asking the unicode function to decode a byte string and then giving it the wrong encoding.
Pasting your string into Python-2.7 on OS-X gives
>>> 'восстановление информации'
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
At this stage it is already a UTF-8 encoded string (probably your terminal determined this), so you can decode it by specifying the utf-8 codec
>>> 'восстановление информации'.decode('utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
But not UTF-16 as that would be invalid
>>> 'восстановление информации'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
If you want to encode a unicode string to UTF-8 or UTF-16, then use
>>> u'восстановление информации'.encode('utf-16')
'\xff\xfe2\x04>\x04A\x04A\x04B\x040\x04=\x04>\x042\x04;\x045\x04=\x048\x045\x04 \x008\x04=\x04D\x04>\x04#\x04<\x040\x04F\x048\x048\x04'
>>> u'восстановление информации'.encode('utf-8')
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
Notice the input strings are unicode (have a u at the front), but the outputs here are byte-strings (they don't have u at the start) which contain the unicode data encoded in the respective formats.
I am at a scenario where I call api and based on the results from api I call database for each record that I in api. My api call return strings and when I make the database call for the items return by api, for some elements I get the following error.
Traceback (most recent call last):
File "TopLevelCategories.py", line 267, in <module>
cursor.execute(categoryQuery, {'title': startCategory});
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
query = query % db.literal(args)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
return self.escape(o, self.encoders)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
The segment of my code the above error is referring is:
...
for startCategory in value[0]:
categoryResults = []
try:
categoryRow = ""
baseCategoryTree[startCategory] = []
#print categoryQuery % {'title': startCategory};
cursor.execute(categoryQuery, {'title': startCategory}) #unicode issue
done = False
cont...
After doing some google search I tried the following on my command line to understand whats going on...
>>> import sys
>>> u'\u2013'.encode('iso-8859-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 0: ordinal not in range(256)
>>> u'\u2013'.encode('cp1252')
'\x96'
>>> '\u2013'.encode('cp1252')
'\\u2013'
>>> u'\u2013'.encode('cp1252')
'\x96'
But I am not sure what would be the solution to overcome this issue. Also I don't know what is the theory behind encode('cp1252') it would be great if I can get some explanation on what I tried above.
If you need Latin-1 encoding, you have several options to get rid of the en-dash or other code points above 255 (characters not included in Latin-1):
>>> u = u'hello\u2013world'
>>> u.encode('latin-1', 'replace') # replace it with a question mark
'hello?world'
>>> u.encode('latin-1', 'ignore') # ignore it
'helloworld'
Or do your own custom replacements:
>>> u.replace(u'\u2013', '-').encode('latin-1')
'hello-world'
If you aren't required to output Latin-1, then UTF-8 is a common and preferred choice. It is recommended by the W3C and nicely encodes all Unicode code points:
>>> u.encode('utf-8')
'hello\xe2\x80\x93world'
The unicode character u'\02013' is the "en dash". It is contained in the Windows-1252 (cp1252) character set (with the encoding x96), but not in the Latin-1 (iso-8859-1) character set. The Windows-1252 character set has some more characters defined in the area x80 - x9f, among them the en dash.
The solution would be for you to choose a different target character set than Latin-1, such as Windows-1252 or UTF-8, or to replace the en dash with a simple "-".
u.encode('utf-8') converts it to bytes which can then be printed on stdout using sys.stdout.buffer.write(bytes)
checkout the displayhook on
https://docs.python.org/3/library/sys.html
Is this a bug?
>>> import json
>>> import cPickle
>>> json.dumps(cPickle.dumps(u'å'))
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/__init__.py", line 230, in dumps
return _default_encoder.encode(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/json/encoder.py", line 361, in encode
return encode_basestring_ascii(o)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data
The json module is expecting strings to encode text. Pickled data isn't text, it's 8-bit binary.
One simple workaround, if you really need to send pickled data over JSON, is to use base64:
j = json.dumps(base64.b64encode(cPickle.dumps(u'å')))
cPickle.loads(base64.b64decode(json.loads(j)))
Note that this is very clearly a Python bug. Protocol version 0 is explicitly documented as ASCII, yet å is sent as the non-ASCII byte \xe5 instead of encoding it as "\u00E5". This bug was reported upstream--and the ticket was closed without the bug being fixed. http://bugs.python.org/issue2980
Could be a bug in pickle. My python documentation says (for used pickle format): Protocol version 0 is the original ASCII protocol and is backwards compatible with earlier versions of Python. [...] If a protocol is not specified, protocol 0 is used.
>>> cPickle.dumps(u'å').decode('ascii')
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 1: ordinal not in range(128)
that aint no ASCII
and, don't know whether its relevant, or even a problem:
>>> cPickle.dumps(u'å') == pickle.dumps(u'å')
False
I'm using Python2.6 and your code runs without any error.
In [1]: import json
In [2]: import cPickle
In [3]: json.dumps(cPickle.dumps(u'å'))
Out[3]: '"V\\u00e5\\np1\\n."'
BTW, what's your system default encoding, in my case, it's
In [6]: sys.getdefaultencoding()
Out[6]: 'ascii'