Understanding unicode and encoding in Python - python

When I enter following in the python 2.7 console
>>>'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>>u'áíóús'
u'\xe1\xed\xf3\xfas'
I get the above output. What is the difference between the two? I understand the basics of unicode, and different kind of encoding like UTF8, UTF16 etc. But, I don't understand what is being printed on the console or how to make sense of it.

u'áíóús' is a string of text. What you see echoed in the REPL is the canonical representation of that object:
>>> print u'áíóús'
áíóús
>>> print repr(u'áíóús')
u'\xe1\xed\xf3\xfas'
The things like \xe1 are related to hexadecimal ordinals of each character:
>>> [hex(ord(c)) for c in u'áíóús']
['0xe1', '0xed', '0xf3', '0xfa', '0x73']
Only the last character was in the ascii range, i.e. ordinals in range(128), so only that last character "s" is plainly visible in Python 2.x:
>>> chr(0x73)
's'
'áíóús' is a string of bytes. What you see printed is an encoding of the same text characters, with your terminal emulator assuming the encoding:
>>> 'áíóús'
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'
>>> u'áíóús'.encode('utf-8')
'\xc3\xa1\xc3\xad\xc3\xb3\xc3\xbas'

Related

String has unicode code points embedded, how to convert? Python 3 [duplicate]

I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

How to print Unicode like “u{variable}” in Python 2.7?

For example, I can print Unicode symbol like:
print u'\u00E0'
Or
a = u'\u00E0'
print a
But it looks like I can't do something like this:
a = '\u00E0'
print someFunctionToDisplayTheCharacterRepresentedByThisCodePoint(a)
The main use case will be in loops. I have a list of unicode code points and I wish to display them on console. Something like:
with open("someFileWithAListOfUnicodeCodePoints") as uniCodeFile:
for codePoint in uniCodeFile:
print codePoint #I want the console to display the unicode character here
The file has a list of unicode code points. For example:
2109
OOBO
00E4
1F1E6
The loop should output:
℉
°
ä
🇦
Any help will be appreciated!
This is probably not a great way, but it's a start:
>>> x = '00e4'
>>> print unicode(struct.pack("!I", int(x, 16)), 'utf_32_be')
ä
First, we get the integer represented by the hexadecimal string x. We pack that into a byte string, which we can then decode using the utf_32_be encoding.
Since you are doing this a lot, you can precompile the struct:
int2bytes = struct.Struct("!I").pack
with open("someFileWithAListOfUnicodeCodePoints") as fh:
for code_point in fh:
print unicode(int2bytes(int(code_point, 16)), 'utf_32_be')
If you think it's clearer, you can also use the decode method instead of the unicode type directly:
>>> print int2bytes(int('00e4', 16)).decode('utf_32_be')
ä
Python 3 added a to_bytes method to the int class that lets you bypass the struct module:
>>> str(int('00e4', 16).to_bytes(4, 'big'), 'utf_32_be')
"ä"
You want print unichr(int('00E0',16)). Convert the hex string to an integer and print its Unicode codepoint.
Caveat: On Windows codepoints > U+FFFF won't work.
Solution: Use Python 3.3+ print(chr(int(line,16)))
In all cases you'll still need to use a font that supports the glyphs for the codepoints.
These are unicode code points but lack the \u python unicode-escape. So, just put it in:
with open("someFileWithAListOfUnicodeCodePoints", "rb") as uniCodeFile:
for codePoint in uniCodeFile:
print "\\u" + codePoint.strip()).decode("unicode-escape")
Whether this works on a given system depends on the console's encoding. If its a Windows code page and the characters are not in its range, you'll still get funky errors.
In python 3 that would be b"\\u".

convert Unicode to normal string [duplicate]

When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.

Why does this string gets printed out like this?

i am playing around with string formatting. And actually i trying to understand the following piece of code:
mystring = "\x80" * 50;
print mystring
output:
>>>
€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€
>>>
the output is one string of Euro sings. But why is this like that? This is no ASCII afaik, and the other question i am asking myself is why does it not print out the hex \x80 ? Thanks in advance
As for the first question, \x80 Is interpreted as \u0080. A nice explanation can be found at Bytes in a unicode Python string.
Edit:
#Joran Besley is right, so let me rephrase it:
u'\x80' is equal to u'\u0080'.
In fact:
unicode(u'\u0080')
>>> u'\x80'
and that's because Python < 3 prefers \x as escaping representation of Unicode characters when possible, that is as long as the code point is less than 256. After that it uses the normal \u:
unicode(u'\u2019')
>>> u'\u2019' # curved quotes in windows-1252
Where the character is then mapped depends on your terminal encoding. As Joran said, you are probably using Windows-1252 or something close to it, where the euro symbol is the hex byte 0x80. In iso-8898-15 for example the hex value is 0xa4:
"\xa4".decode("iso-8859-15") == "\x80".decode('windows-1252')
>>> True
If you are curious about your terminal encoding you can get it from sys
import sys
sys.stdin.encoding
>>> 'UTF-8' # my terminal
sys.stdout.encoding
>>> 'UTF-8' # same as above
I hope it makes up for my mistake.
A little tinkering in IDLE produced this output.
>>> a = "\x80"
>>> a
'\x80'
>>> print a * 50
€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€€
>>> print a
€
>>>
The first thing that stands out is the '\' character. This character is used for escaping characters in strings. You can learn about escaping characters in the link below.
http://en.wikipedia.org/wiki/Escape_character
Changing the string slightly tells us that escaping is occurring.
>>> print '\x8'
ValueError: invalid \x escape
What I think is happening is the escape is causing the string to be looked up in the ASCII (or similar) table.
It depends on your terminal encoding ... in the windows terminal that encodes to a bunch of C-cedilla's
if you want to see the "\x80" you can print repr(mystring)
furthermore 0x80 = 128 which is the (not ascii,since ascii only technically goes to 0x7f) value of the euro
specifically that is how "Windows-1252" encodes the euro sign (actually apparently thats how almost all the "Windows-125x" encode the euro sign)
this answer has lots more info
Hex representation of Euro Symbol €
furthermore you can convert it to unicode
unicode_ch = "\x80".decode("Windows-1252") #it is now decoded into unicode
print repr(unicode_ch) # \u20AC the unicode equivalent of Euro
print unicode_ch #as long as your terminal can handle it

Encoding used for u"" literals

Consider the next example:
>>> s = u"баба"
>>> s
u'\xe1\xe0\xe1\xe0'
>>> print s
áàáà
I'm using cp1251 encoding within the idle, but it seems like the interpreter actually uses latin1 to create unicode string:
>>> print s.encode('latin1')
баба
Why so? Is there spec for such behavior?
CPython, 2.7.
Edit
The code I was actually looking for is
>>> u'\xe1\xe0\xe1\xe0' == u'\u00e1\u00e0\u00e1\u00e0'
True
Seems like when encoding unicode with latin1 codec, all unicode points less that 256 are simply left as is thus resulting in bytes which I typed in before.
When you type a character such as б into the terminal, you see a б, but what is really inputted is a sequence of bytes.
Since your terminal encoding is cp1251, typing баба results in the sequence of bytes equal to the unicode баба encoded in cp1251:
In [219]: "баба".decode('utf-8').encode('cp1251')
Out[219]: '\xe1\xe0\xe1\xe0'
(Note I use utf-8 above because my terminal encoding is utf-8, not cp1251. For me, "баба".decode('utf-8') is just unicode for баба.)
Since typing баба results in the sequence of bytes \xe1\xe0\xe1\xe0, when you type u"баба" into the terminal, Python receives u'\xe1\xe0\xe1\xe0' instead. This is why you are seeing
>>> s
u'\xe1\xe0\xe1\xe0'
This unicode happens to represent áàáà.
And when you type
>>> print s.encode('latin1')
the latin1 encoding converts u'\xe1\xe0\xe1\xe0' to '\xe1\xe0\xe1\xe0'.
The terminal receives the sequence of bytes '\xe1\xe0\xe1\xe0', and decodes them with cp1251, thus printing баба:
In [222]: print('\xe1\xe0\xe1\xe0'.decode('cp1251'))
баба
Try:
>>> s = "баба"
(without the u) instead. Or,
>>> s = "баба".decode('cp1251')
to make s unicode. Or, use the verbose but very explicit (and terminal-encoding agnostic):
>>> s = u'\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER BE}\N{CYRILLIC SMALL LETTER A}'
Or the short but less-readily comprehensible
>>> s = u'\u0431\u0430\u0431\u0430'

Categories