Python 3 print utf-8 encoded string problem - python

I'm requesting a string from a network-service. When I print it from within a program:
variable = getFromNetwork()
print(variable)
and I execute it using python3 net.py I get:
\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612
When I execute in the python3 CLI:
>>> print("\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612")
تÙ
Ù
Ù612
Buy when I execute in the python2 CLI I get the correct result:
>>> print("\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612")
تملي612
How I can print this in my program by python3?
Edit
After executing the following line:
print(print(type(variable), repr(variable)))
Got
<class 'str'> '\\xd8\\xaa\\xd9\\x85\\xd9\\x84\\xd9\\x8a612'
I think I should first remove\\x to make it hex and then decode it. What is your solutions!?

You need to specify the encoding, so the interpreter knows how to interpret the data:
s = "\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612"
y = s.encode('raw_unicode_escape')
print (y) # is a bytes object now!
print (y.decode('utf-8'))
Out:
b'\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612'
تملي612

Your variable is a (unicode) string that contains code for a UTF8 encoded byte string. It can happen because it was erroneously decoded with a wrong encoding (probably Latin1 here).
You can fix it by first converting to a byte string without changing the codes (so with a Latin1 encoding) and then you will be able to correctly decode it:
variable = getFromNetwork().encode('Latin1').decode()
print(variable)
Demo:
variable = "\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612"
print(variable.encode('Latin1').decode())
تملي612

in python 3 i tested with the following code
line='\xd8\xaa\xd9\x85\xd9\x84\xd9\x8a612'
line = line.encode('raw_unicode_escape')
line=line.decode("utf-8")
print(line)
it prints
تملي612

Related

Decoding Python Unicode strings that contain double blackslashes

My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98
If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.
If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.
How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?
In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.
Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.
import unicodedata as ud
src = '\\xec\\x88\\x98'
print repr(src)
s = src.decode('string-escape')
print repr(s)
u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')
output
'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218
However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

I will never know the decoding and encoding system of python2.7

Sorry for asking question like a fool, but maybe someone could help me get out of the decode/encode hell of python2.7
I have a string as below, I'm not sure but I think it's encoded as UTF-8 because I wrote# -*- coding: utf-8 -*- at the head of the py file
s = "今日もしないとね"
and as my point of view, if it's a string, part of it could be printed out by using [] like this:
print s[1]
Then I got a error in my sublime:
[Decode error - output not utf-8]
I tried in terminal I got a
?
Okay, maybe a part of a utf-8 string would become not an utf-8 string, so I tried:
print s[1].encode("utf-8")
then I got this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbb in position 0: ordinal not in range(128)
I was totally confused. Does it mean that a part of a string is a ascii like\xbb?
Could anybody tell me what are the encoding of the following stuff?
a = "今日もしないとね"
b = u"今日もしないとね"
c = "python2.7 fxxked me"
d = u"python2.7 fxxked me"
e = "今"
f = "z"
aa = a[0]
bb = b[0]
cc = c[0]
dd = d[0]
and How to get "今日" from "今日,もしないとね"?
Thank you!
Your file is correctly encoded in UTF-8 but your operating system doesn't (directly) support Unicode on output.
The right way to specify a Unicode string literal in Python 2 is using the "u" prefix. Only in this case the Unicode string is actually stored there.
By the way, you can see what Python actually thinks about your variable content using the repr function:
>>> print a
'\xe4\xbb\x8a\xe6\x97\xa5\xe3\x82\x82\xe3\x81\x97\xe3\x81\xaa\xe3\x81\x84\xe3\x81\xa8\xe3\x81\xad'
>>> print b
u'\u4eca\u65e5\u3082\u3057\u306a\u3044\u3068\u306d'
As the comments suggest - unicode isn't as easy to discover-learn as many other parts of Python
The following code sample will print "今日"
# -*- coding: utf-8 -*-
b = u"今日もしないとね"
print b[:2]
however - the coding line only tells Python how to interpret those bytes in the file. Many editors won't look for the coding line and you'll need to make sure that they are in fact also using utf-8 when working out how to display those bytes to you.
When Python gets to the print statement, it will take the unicode object b and encode it using sys.stdout.encoding. Now this better also match your terminal/console settings, or you will get some garbage printed instead.

Special characters appearing as question marks

Using the Python programming language, I'm having trouble outputting characters such as å, ä and ö. The following code gives me a question mark (?) as output, not an å:
#coding: iso-8859-1
input = "å"
print input
The following code lets you input random text. The for-loop goes through each character of the input, adds them to the string variable a and then outputs the resulting string. This code works correctly; you can input å, ä and ö and the output will still be correct. For example, "år" outputs "år" as expected.
#coding: iso-8859-1
input = raw_input("Test: ")
a = ""
for i in range(0, len(input)):
a = a + input[i]
print a
What's interesting is that if I change input = raw_input("Test: ") to input = "år", it will output a question mark (?) for the "å".
#coding: iso-8859-1
input = "år"
a = ""
for i in range(0, len(input)):
a = a + input[i]
print a
For what it's worth, I'm using TextWrangler, and my document's character encoding is set to ISO Latin 1. What causes this? How can I solve the problem?
You're using Python 2, I assume running on a platform like Linux that encodes I/O in UTF-8.
Python 2's "" literals represent byte-strings. So when you specify "år" in your ISO 8859-1-encoded source file, the variable input has the value b'\xe5r'. When you print this, the raw bytes are output to the console, but show up as a question-mark because they are not valid UTF-8.
To demonstrate, try it with print repr(a) instead of print a.
When you use raw_input(), the user's input is already UTF-8-encoded, and so are correctly output.
To fix this, either:
Encode your string as UTF-8 before printing it:
print a.encode('utf-8')
Use Unicode strings (u'text') instead of byte-strings. You will need to be careful with decoding the input, since on Python 2, raw_input() returns a byte-string rather than a text string. If you know the input is UTF-8, use raw_input().decode('utf-8').
Encode your source file in UTF-8 instead of iso-8859-1. Then the byte-string literal will already be in UTF-8.

What type of represantation is default in python to store Unicode strings?

If I do this in python:
>>> name = "âțâîâ"
>>> name
'\xc3\xa2\xc8\x9b\xc3\xa2\xc3\xae\xc3\xa2'
>>> len(name)
10
>>> u = name.decode('utf-8')
>>> len (u)
5
>>>
What is the default encoding in python if you don't specify any ?
You are specifying a python string literal, and their encoding is determined by the default settings of your editor (or in the case of the python interpreter, of your terminal). Python did not have a say in this.
By default, python 2 tries to interpret source code as ASCII. In python 3 this has been switched to UTF-8.
Please read the Python Unicode HOWTO to further understand the difference between Unicode and input and output encodings. You really also should read Joel Spolksy's article on Unicode.
Probably you are using Python 2. (If not, this answer is bad.)
What happens is the following:
>>> name = "âțâîâ"
You assign to name a (byte) string whose contents is determined by your encoding of the terminal resp. of your text editor. In your case, this is obviously UTF8.
These bytes are shown with
>>> name
'\xc3\xa2\xc8\x9b\xc3\xa2\xc3\xae\xc3\xa2'
Only if you decode it with
>>> u = name.decode('utf-8')
you get a unicode string. Here you specify that encoding.
A simpler and more reliably way would be to directly do
u = u"âțâîâ"
and only then extract the bytes according to your wanted encoding:
name = u.encode("utf-8")

Python Character Encoding

I have a python script that retrieves information from a web service and then looks up data in a MySQL db. The data is unicode when I receive it, however I want the SQL statement to use the actual character (Băcioi in the example below). As you can see, when I try and encode it to utf-8 the result is still not what I'm looking for.
>>> x = u'B\u0103cioi'
>>> x
u'B\u0103cioi'
>>> x.encode('utf-8')
'B\xc4\x83cioi'
>>> print x
Băcioi ## << What I want!
Your encoding is working fine. Python is simply showing you the repr()'d version of it on the command line, which uses \x escapes. You can tell because of the fact that it's also displaying the quotes around the string.
print does not do any mutation of the string - if it prints out the character you want, that's what is actually in the contents of the string.

Categories