Sorry for asking question like a fool, but maybe someone could help me get out of the decode/encode hell of python2.7
I have a string as below, I'm not sure but I think it's encoded as UTF-8 because I wrote# -*- coding: utf-8 -*- at the head of the py file
s = "今日もしないとね"
and as my point of view, if it's a string, part of it could be printed out by using [] like this:
print s[1]
Then I got a error in my sublime:
[Decode error - output not utf-8]
I tried in terminal I got a
?
Okay, maybe a part of a utf-8 string would become not an utf-8 string, so I tried:
print s[1].encode("utf-8")
then I got this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0: ordinal not in range(128)
I was totally confused. Does it mean that a part of a string is a ascii like\xbb?
Could anybody tell me what are the encoding of the following stuff?
a = "今日もしないとね"
b = u"今日もしないとね"
c = "python2.7 fxxked me"
d = u"python2.7 fxxked me"
e = "今"
f = "z"
aa = a[0]
bb = b[0]
cc = c[0]
dd = d[0]
and How to get "今日" from "今日,もしないとね"?
Thank you!
Your file is correctly encoded in UTF-8 but your operating system doesn't (directly) support Unicode on output.
The right way to specify a Unicode string literal in Python 2 is using the "u" prefix. Only in this case the Unicode string is actually stored there.
By the way, you can see what Python actually thinks about your variable content using the repr function:
>>> print a
'\xe4\xbb\x8a\xe6\x97\xa5\xe3\x82\x82\xe3\x81\x97\xe3\x81\xaa\xe3\x81\x84\xe3\x81\xa8\xe3\x81\xad'
>>> print b
u'\u4eca\u65e5\u3082\u3057\u306a\u3044\u3068\u306d'
As the comments suggest - unicode isn't as easy to discover-learn as many other parts of Python
The following code sample will print "今日"
# -*- coding: utf-8 -*-
b = u"今日もしないとね"
print b[:2]
however - the coding line only tells Python how to interpret those bytes in the file. Many editors won't look for the coding line and you'll need to make sure that they are in fact also using utf-8 when working out how to display those bytes to you.
When Python gets to the print statement, it will take the unicode object b and encode it using sys.stdout.encoding. Now this better also match your terminal/console settings, or you will get some garbage printed instead.
Related
How can I correctly print Vietnamese cp1258 encoded characters in python 3? My terminal doesnt seem to be the issue as it will print the first print statement in my code correctly. I am trying to decode hex characters to vietnamese
Code:
import binascii
data = 'tạm biệt'
print(data) # tạm biệt
a = binascii.hexlify(data.encode('cp1258', errors='backslashreplace'))
print(a) # b'745c75316561316d2062695c753165633774'
# if i dont use the error handler here, then I get a UnicodeEncodeError for \u1ea1
print(
binascii.unhexlify(a).decode('cp1258') # t\u1ea1m bi\u1ec7t
)
There seems to be an omission in Python's support for code page 1258. The legacy codec does support Vietnamese by way of combining diacritics, but Python doesn't know how to convert Unicode to these combinations. I guess you will have to perform your own conversion.
As a first step, observe that unicodedata.normalize('NFD', data) splits the representation into a base character and a sequence of combining diacritics.
>>> unicodedata.normalize('NFD', data).encode('utf-8')
b'ta\xcc\xa3m bie\xcc\xa3\xcc\x82t'
>>> '{0:04x}'.format(ord(b'\xcc\xa3'.decode('utf-8')))
'0323'
So U+0323 is the combining Unicode diacritic for dot-under, and this correspondence should be known to the codec (the Wikipedia page I link to above shows the same Unicode character code for the CP1258 code point 0xF2).
I don't know enough about the target codec to tell you how to map these to CP1258, but if you are lucky, there is already some sort of mapping of these in the Python codec.
iconv on my Mojave MacOS seems to convert this without a hitch:
$ iconv -t cp1258 <<<'tạm biệt' | xxd
00000000: 7461 f26d 2062 69ea f274 0a ta.m bi..t.
From this, it looks like the diacritic applies straightforwardly as a suffix -- 61 is a, and as noted above, f2 is the combining diacritic to place a dot under the main glyph.
If you have a working iconv, a quick and dirty workaround might be to run it as a subprocess.
import subprocess
converted = subprocess.run(['iconv', '-t', 'cp1258'],
input=data.encode('utf-8'), stdout=subprocess.PIPE).stdout
If my understanding is correct, this should really be reported as a bug in Python. It should definitely know how to round-trip between this codec and Unicode if it wants to claim that it supports it.
I figured it out. Decoding with unicode-escape does the trick.
import binascii
data = u'tạm biệt'
print(data) # tạm biệt
a = binascii.hexlify(data.encode('cp1258', errors='backslashreplace'))
print(a) # b'745c75316561316d2062695c753165633774'
# if i dont use the error handler here, then I get a UnicodeEncodeError for \u1ea1
print(
binascii.unhexlify(a).decode('unicode-escape') # tạm biệt
)
My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.
If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.
How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?
In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.
I'm wondering how to get the Unicode representation of Arabic strings like سلام in Python?
The result should be \u0633\u0644\u0627\u0645
I need that so that I can compare texts retrieved from mysql db and data stored in redis cache.
Assuming you have an actual Unicode string, you can do
# -*- coding: utf-8 -*-
s = u'سلام'
print s.encode('unicode-escape')
output
\u0633\u0644\u0627\u0645
The # -*- coding: utf-8 -*- directive is purely to tell the interpreter that the source code is UTF-8 encoded, it has no bearing on how the script itself handles Unicode.
If your script is reading that Arabic string from a UTF-8 encoded source, the bytes will look like this:
\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85
You can convert that to Unicode like this:
data = '\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
s = data.decode('utf8')
print s
print s.encode('unicode-escape')
output
سلام
\u0633\u0644\u0627\u0645
Of course, you do need to make sure that your terminal is set up to handle Unicode properly.
Note that
'\u0633\u0644\u0627\u0645'
is a plain (byte) string containing 24 bytes, whereas
u'\u0633\u0644\u0627\u0645'
is a Unicode string containing 4 Unicode characters.
You may find this article helpful: Pragmatic Unicode, which was written by SO veteran Ned Batchelder.
Since you're using Python 2.x, you'll not be able to use encode. You'll need to use the unicode function to cast the string to a unicode object.
> f='سلام'
> f
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
> unicode(f, 'utf-8') # note: you need to pass the encoding parameter in or you'll
# keep having the same problem.
u'\u0633\u0644\u0627\u0645'
> print unicode(f, 'utf-8')
سلام
I'm not sure what library you're using to fetch the content, but you might be able to fetch the data as unicode initially.
> f = u'سلام'
> f
u'\u0633\u0644\u0627\u0645'
> print f.encode('unicode-escape')
\u0633\u0644\u0627\u0645
> print f
سلام
For python 2.7
string = 'سلام'
new_string = unicode(string)
Prepend your string with u in python 2.x, which makes your string a unicode string. Then you can call the encode method of a unicode string.
arabic_string = u'سلام'
arabic_string.encode('utf-8')
Output:
print arabic_string.encode('utf-8')
'\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85'
I am currently learning Python and I came across the following code:
text=raw_input()
for letter in text:
x=[alpha_dic[letter]]
print x
When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.
You get some trouble from multiple shortcomings in Python (2.x).
raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x
I got this to work borrowing from this answer:
# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
x=[alpha_dict[unicode(letter)]]
print x
>>> ü
>>> ['umlaut']
Python 2 and unicode are not for the feint of heart...
With ord(ch) you can get a numerical code for character ch up to 127. Is there any function that returns a number from 0-255, so to cover also ISO 8859-1 characters?
Edit: Follows my last version of code and error I get
#!/usr/bin/python
# coding: iso-8859-1
import sys
reload(sys)
sys.setdefaultencoding('iso-8859-1')
print sys.getdefaultencoding() # prints "iso-8859-1"
def char_code(c):
return ord(c.encode('iso-8859-1'))
print char_code(u'à')
I get an error:
TypeError: ord() expected a character, but string of length 2 found
When you're starting with a Unicode string, you need to encode rather than decode.
>>> def char_code(c):
return ord(c.encode('iso-8859-1'))
>>> print char_code(u'à')
224
For ISO-8859-1 in particular, you don't even need to encode it at all, since Unicode uses the ISO-8859-1 characters for its first 256 code points.
>>> print ord(u'à')
224
Edit: I see the problem now. You've given a source code encoding comment that indicates the source is in ISO-8859-1. However, I'll bet that your editor is actually working in UTF-8. The source code will be mis-interpreted, and the single-character string you think you created will actually be two characters. Try the following to see:
print len(u'à')
If your encoding is correct, it will return 1, but in your case it's probably 2.
You can get ord() for anything. As you might expect, ord(u'💩') works fine, provided you can represent the character properly in your source, and/or read it in a known encoding.
Your error message vaguely suggests that coding: iso-8859-1 is not actually true, and the file's encoding is actually something else (UTF-8 or UTF-16 would be my guess).
The canonical must-read on character encoding in Python is http://nedbatchelder.com/text/unipain.html
You can still use ord(), but you have to decode it.
Like this:
def char_code(c):
return ord(c.decode('iso-8859-1'))