I have some Japaneses character in my nova.conf file.
クラ
After reading from config file I have to decode it in utf-8 like
my_data = CONF.test.test
my_data = my_data.decode('utf-8')
When I use variable without decode, its giving UnicodeDecodeError
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
Is there any way to read data from config directly in decoded form?
Is there any way to read data from config directly in decoded form?
not in python 2.7. Because in python prior to python 3, strings default to ASCII strings, whereas in python 3 strings default to unicode strings. So basically:
>>> mydata = "クラ"
>>> print mydata.decode('utf-8')
クラ
>>> print mydata
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
whereas in python 3:
>>> mydata = "クラ"
>>> print(mydata)
クラ
So if you want to handle unicode strings painlessly, it's time for you to do the switch.
Related
I am using Python 3 but found this error in server log while doing convert format from string to byte
b'\x00\x01_\x97'.decode()
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
b'\x00\x01_\x97'.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 3: invalid start byte
How can I convert a string to its byte value? I
You need to specify the encoding type Latin by
>>> b'\x00\x01_\x97'.decode("Latin")
'\x00\x01_\x97'
>>> type(b'\x00\x01_\x97'.decode("Latin"))
<class 'str'>
>>>
I expect the following code works fine, but it's failing, what is the reason?
>>> s = 'ö'
>>> s.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte
In the interactive interpreter, the encoding of a string literal depends entirely on your terminal or console configuration. In your case, that is not set to UTF-8.
You can use the sys.stdin.encoding attribute to determine what codec to use:
>>> s = 'ö'
>>> import sys
>>> s.decode(sys.stdin.encoding)
u'\xf6'
Alternatively, just create a unicode string literal (using the u prefix) directly; the Python interactive interpreter knows to use the sys.stdin.encoding codec for that case:
>>> s = u'ö'
>>> s
u'\xf6'
I have this function to remove accents in a word
def remove_accents(word):
return ''.join(x for x in unicodedata.normalize('NFKD', word) if x in string.ascii_letters)
But when I run it it shows an error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 3: ordinal not in range(128)
The character in position 3 is : ó
If your input is a unicode string, it works:
>>> remove_accents(u"foóbar")
u'foobar'
If it isn't, it doesn't. I don't get the error you describe, I get a TypeError instead, and only get the UnicodeDecodeError if I try to cast it to unicode by doing
>>> remove_accents(unicode("foóbar"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
If that is your problem, i.e. you have Python 2 str objects as an input, you can solve it by decoding it as utf-8 first:
>>> remove_accents("foóbar".decode("utf-8"))
u'foobar'
>>> unicode('восстановление информации', 'utf-16')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
>>> unicode('восстановление информации', 'utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
Why do these Russian words encode in UTF-8 fine, but not UTF-16?
You are asking the unicode function to decode a byte string and then giving it the wrong encoding.
Pasting your string into Python-2.7 on OS-X gives
>>> 'восстановление информации'
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
At this stage it is already a UTF-8 encoded string (probably your terminal determined this), so you can decode it by specifying the utf-8 codec
>>> 'восстановление информации'.decode('utf-8')
u'\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u043e\u0432\u043b\u0435\u043d\u0438\u0435 \u0438\u043d\u0444\u043e\u0440\u043c\u0430\u0446\u0438\u0438'
But not UTF-16 as that would be invalid
>>> 'восстановление информации'.decode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0xb8 in position 48: truncated data
If you want to encode a unicode string to UTF-8 or UTF-16, then use
>>> u'восстановление информации'.encode('utf-16')
'\xff\xfe2\x04>\x04A\x04A\x04B\x040\x04=\x04>\x042\x04;\x045\x04=\x048\x045\x04 \x008\x04=\x04D\x04>\x04#\x04<\x040\x04F\x048\x048\x04'
>>> u'восстановление информации'.encode('utf-8')
'\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5 \xd0\xb8\xd0\xbd\xd1\x84\xd0\xbe\xd1\x80\xd0\xbc\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8'
Notice the input strings are unicode (have a u at the front), but the outputs here are byte-strings (they don't have u at the start) which contain the unicode data encoded in the respective formats.
I am at a scenario where I call api and based on the results from api I call database for each record that I in api. My api call return strings and when I make the database call for the items return by api, for some elements I get the following error.
Traceback (most recent call last):
File "TopLevelCategories.py", line 267, in <module>
cursor.execute(categoryQuery, {'title': startCategory});
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/cursors.py", line 158, in execute
query = query % db.literal(args)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 265, in literal
return self.escape(o, self.encoders)
File "/opt/ts/python/2.7/lib/python2.7/site-packages/MySQLdb/connections.py", line 203, in unicode_literal
return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
The segment of my code the above error is referring is:
...
for startCategory in value[0]:
categoryResults = []
try:
categoryRow = ""
baseCategoryTree[startCategory] = []
#print categoryQuery % {'title': startCategory};
cursor.execute(categoryQuery, {'title': startCategory}) #unicode issue
done = False
cont...
After doing some google search I tried the following on my command line to understand whats going on...
>>> import sys
>>> u'\u2013'.encode('iso-8859-1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 0: ordinal not in range(256)
>>> u'\u2013'.encode('cp1252')
'\x96'
>>> '\u2013'.encode('cp1252')
'\\u2013'
>>> u'\u2013'.encode('cp1252')
'\x96'
But I am not sure what would be the solution to overcome this issue. Also I don't know what is the theory behind encode('cp1252') it would be great if I can get some explanation on what I tried above.
If you need Latin-1 encoding, you have several options to get rid of the en-dash or other code points above 255 (characters not included in Latin-1):
>>> u = u'hello\u2013world'
>>> u.encode('latin-1', 'replace') # replace it with a question mark
'hello?world'
>>> u.encode('latin-1', 'ignore') # ignore it
'helloworld'
Or do your own custom replacements:
>>> u.replace(u'\u2013', '-').encode('latin-1')
'hello-world'
If you aren't required to output Latin-1, then UTF-8 is a common and preferred choice. It is recommended by the W3C and nicely encodes all Unicode code points:
>>> u.encode('utf-8')
'hello\xe2\x80\x93world'
The unicode character u'\02013' is the "en dash". It is contained in the Windows-1252 (cp1252) character set (with the encoding x96), but not in the Latin-1 (iso-8859-1) character set. The Windows-1252 character set has some more characters defined in the area x80 - x9f, among them the en dash.
The solution would be for you to choose a different target character set than Latin-1, such as Windows-1252 or UTF-8, or to replace the en dash with a simple "-".
u.encode('utf-8') converts it to bytes which can then be printed on stdout using sys.stdout.buffer.write(bytes)
checkout the displayhook on
https://docs.python.org/3/library/sys.html