This Python script gets translit for Russian letters:
s = u'Код Обмена Информацией, 8 бит'.encode('koi8-r')
print ''.join([chr(ord(c) & 0x7F) for c in s]) # kOD oBMENA iNFORMACIEJ, 8 BIT
That works. But I want to modify it so as to get user input. Now I'm stuck at this:
s = raw_input("Enter a string you want to translit: ")
s = unicode(s)
s = s.encode('koi8-r')
print ''.join([chr(ord(c) & 0x7F) for c in s])
Ending up with this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
What's wrong?
s = unicode(s) expects ascii encoding by default. You need to supply it an encoding your input is in, e.g. s = unicode(s, 'utf-8').
try unicode(s, encoding) where encoding is whatever your terminal is in.
Looking at the error messages that you are seeing, it seems to me that your terminal encoding is probably set to KOI8-R, in which case you don't need to perform any decoding on the input data. If this is the case then all you need is:
>>> s = raw_input("Enter a string you want to translit: ")
>>> print ''.join([chr(ord(c) & 0x7F) for c in s])
kOD oBMENA iNFORMACIEJ, 8 BIT
You can double check this by s.decode('koi8-r') which should succeed and return the equivalent unicode string.
Related
I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.
I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.
cp1252_to_unicode = {
"\x85": u'\u2026', # …
"\x91": u'\u2018', # ‘
"\x92": u'\u2019', # ’
"\x93": u'\u201c', # “
"\x94": u'\u201d', # ”
"\x97": u'\u2014' # —
}
for l in open('file.txt'):
for c, u in cp1252_to_unicode.items():
l = l.replace(c, u)
But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:
"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Any ideas for how to deal with this?
If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -
However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.
In my utf-8 terminal, I can build a mixed incorrect string like this:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma��
>>> a.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.
The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):
import codecs
last_position = -1
def mixed_decoder(unicode_error):
global last_position
string = unicode_error[1]
position = unicode_error.start
if position <= last_position:
position = last_position + 1
last_position = position
new_char = string[position].decode("cp1252")
#new_char = u"_"
return new_char, position + 1
codecs.register_error("mixed", mixed_decoder)
And on the console:
>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã
With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.
#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")
This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.
import codecs
replacement = {
'85' : '...', # u'\u2026' ... character.
'96' : '-', # u'\u2013' en-dash
'97' : '-', # u'\u2014' em-dash
'91' : "'", # u'\u2018' left single quote
'92' : "'", # u'\u2019' right single quote
'93' : '"', # u'\u201C' left double quote
'94' : '"', # u'\u201D' right double quote
'95' : "*" # u'\u2022' bullet
}
#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
errStr = unicodeError[1]
errLen = unicodeError.end - unicodeError.start
nextPosition = unicodeError.start + errLen
errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
if errHex in replacement:
return u'%s' % replacement[errHex], nextPosition
return u'%s' % errHex, nextPosition # Comment this line out to get a question mark
return u'?', nextPosition
codecs.register_error("mixed", mixed_decoder)
xmlText = xmlText.decode("utf-8", "mixed")
Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.
This is not pretty but it does allow me to make sense of messed up data
Good solution that of #jsbueno, but there is no need of global variable last_position, see:
def mixed_decoder(error: UnicodeError) -> (str, int):
bs: bytes = error.object[error.start: error.end]
return bs.decode("cp1252"), error.start + 1
import codecs
codecs.register_error("mixed", mixed_decoder)
a = "maçã".encode("utf-8") + "maçã".encode("cp1252")
# a = b"ma\xc3\xa7\xc3\xa3ma\xe7\xe3"
s = a.decode("utf-8", "mixed")
# s = "maçãmaçã"
This is usually called Mojibake.
There's a nice Python library that might solve these issues for you called ftfy.
Example:
>>> from ftfy import fix_text
>>> fix_text("Ð¨ÐµÐ¿Ð¾Ñ (напоминалки)")
'Шепот (напоминалки)'
Just came into this today, so here is my problem and my own solution:
original_string = 'Notifica\xe7\xe3o de Emiss\xe3o de Nota Fiscal Eletr\xf4nica.'
def mixed_decoding(s):
output = ''
ii = 0
for c in s:
if ii <= len(s)-1:
if s[ii] == '\\' and s[ii+1] == 'x':
b = s[ii:ii+4].encode('ascii').decode('unicode-escape')
output = output+b
ii += 3
else:
output = output+s[ii]
ii += 1
print(output)
return output
decoded_string = mixed_decoding(original_string)
Now it prints:
>>> Notificação de Emissão de Nota Fiscal Eletrônica.
I want to print '\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad' which is a Chinese character.
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('utf-8')]
print(a[0])
But it raises this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xd6 in position 0: invalid continuation byte. I also tried deocde('latin-1'). But the result aren't Chinese characters.
Try with:
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('gb2312').encode('utf-8')]
print(a[0])
output:
中国黑龙江
Update: as Mark's advice, use l[0].decode('gb2312') will be sufficient.
l = ['\xd6\xd0\xb9\xfa\xba\xda\xc1\xfa\xbd\xad']
a = [l[0].decode('gb2312')]
print(a[0])
I'm trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decoding those to unicode. It works fine for a single byte, but with two bytes it fails.
For instance, if I run the following in a python shell:
>>> a = str()
>>> a += chr(0xc0) + chr(0xaf)
>>> print a.decode('utf-8')
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte
According to the utf-8 scheme https://en.wikipedia.org/wiki/UTF-8#Description the byte sequence 0xc0 0xaf should be valid as 0xc0 starts with 110 and 0xaf starts with 10.
Here's my python script:
def unicode(self):
'''returns a random (astral) utf encoded byte string'''
num_bytes = random.randint(1,4)
if num_bytes == 1:
return self.gen_utf8(num_bytes, 0x00, 0x7F)
elif num_bytes == 2:
return self.gen_utf8(num_bytes, 0xC0, 0xDF)
elif num_bytes == 3:
return self.gen_utf8(num_bytes, 0xE0, 0xEF)
elif num_bytes == 4:
return self.gen_utf8(num_bytes, 0xF0, 0xF7)
def gen_utf8(self, num_bytes, start_val, end_val):
byte_str = list()
byte_str.append(random.randrange(start_val, end_val)) # start byte
for i in range(0,num_bytes-1):
byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
a = str()
sum = int()
for b in byte_str:
a += chr(b)
ret = a.decode('utf-8')
return ret
if __name__ == "__main__":
g = GenFuzz()
print g.gen_utf8(2,0xC0,0xDF)
This is, indeed, invalid UTF-8. In UTF-8, only code points in the range U+0080 to U+07FF, inclusive, can be encoded using two bytes. Read the Wikipedia article more closely, and you will see the same thing. As a result, the byte 0xc0 may not appear in UTF-8, ever. The same is true of 0xc1.
Some UTF-8 decoders have erroneously decoded sequences like C0 AF as valid UTF-8, which has lead to security vulnerabilities in the past.
Found one standard that actually accepts 0xc0 : encoding="ISO-8859-1"
from https://stackoverflow.com/a/27456542/4355695
But this entails making sure the rest of the file doesn't have unicode chars, so this would not be an exact answer to the question, but may be helpful for folks like me who didn't have any unicode chars in their file anyways and just wanted python to load the damn thing and both utf-8 and ascii encodings were erroring out.
More on ISO-8859-1 : What is the difference between UTF-8 and ISO-8859-1?
I got a list = [0x97, 0x52], not unicode object. this is unicode of a charactor '青'(u'\u9752'). How could I change this list to unicode object first, then encode to 'UTF-8'?
bytes = [0x97, 0x52]
code = bytes[0] * 256 + bytes[1] # build the 16-bit code
char = unichr(code) # convert code to unicode
utf8 = char.encode('utf-8') # encode unicode as utf-8
print utf8 # prints '青'
Not sure if this is the most elegant way, but it works for this particular example.
>>> ''.join([chr(x) for x in [0x97, 0x52]]).decode('utf-16be')
u'\u9752'
I'm falling the unicode hell.
My environment in on unix, python 2.7.3
LC_CTYPE=zh_TW.UTF-8
LANG=en_US.UTF-8
I'm trying to dump hex encoded data in human readable format, here is simplified code
#! /usr/bin/env python
# encoding:utf-8
import sys
s=u"readable\n" # previous result keep in unicode string
s2="fb is not \xfb" # data read from binary file
s += s2
print s # method 1
print s.encode('utf-8') # method 2
print s.encode('utf-8','ignore') # method 3
print s.decode('iso8859-1') # method 4
# method 1-4 display following error message
#UnicodeDecodeError: 'ascii' codec can't decode byte 0xfb
# in position 0: ordinal not in range(128)
f = open('out.txt','wb')
f.write(s)
I just want to print out the 0xfb.
I should describe more here. The key is 's += s2'.
Where s will keep my previous decoded string.
And the s2 is next string which should append into s.
If I modified as following, it occurs on write file.
s=u"readable\n"
s2="fb is not \xfb"
s += s2.decode('cp437')
print s
f=open('out.txt','wb')
f.write(s)
# UnicodeEncodeError: 'ascii' codec can't encode character
# u'\u221a' in position 1: ordinal not in range(128)
I wish the result of out.txt is
readable
fb is not \xfb
or
readable
fb is not 0xfb
[Solution]
#! /usr/bin/env python
# encoding:utf-8
import sys
import binascii
def fmtstr(s):
r = ''
for c in s:
if ord(c) > 128:
r = ''.join([r, "\\x"+binascii.hexlify(c)])
else:
r = ''.join([r, c])
return r
s=u"readable"
s2="fb is not \xfb"
s += fmtstr(s2)
print s
f=open('out.txt','wb')
f.write(s)
I strongly suspect that your code is actually erroring out on the previous line: the s += s2 one. s2 is just a series of bytes, which can't be arbitrarily tacked on to a unicode object (which is instead a series of code points).
If you had intended the '\xfb' to represent U+FB, LATIN SMALL LETTER U WITH CIRCUMFLEX, it would have been better to assign it like this instead:
s2 = u"\u00fb"
But you said that you just want to print out \xHH codes for control characters. If you just want it to be something humans can understand which still makes it apparent that special characters are in a string, then repr may be enough. First, don't have s be a unicode object, because you're treating your strings here as a series of bytes, not a series of code points.
s = s.encode('utf-8')
s += s2
print repr(s)
Finally, if you don't want the extra quotes on the outside that repr adds, for nice pretty printing or whatever, there's not a simple builtin way to do that in Python (that I know of). I've used something like this before:
import re
controlchars_re = re.compile(r'[\x00-\x31\x7f-\xff]')
def _show_control_chars(match):
txt = repr(match.group(0))
return txt[1:-1]
def escape_special_characters(s):
return controlchars_re.sub(_show_control_chars, s.replace('\\', '\\\\'))
You can pretty easily tweak the controlchars_re regex to define which characters you care about escaping.