python u'\u00b0' returns u'\xb0'. Why? - python

I use python 2.7.10.
On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input
>>>u'\u00b0'
results in the following output:
u'\xb0'
I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.
My assumptions (correct me if I am wrong):
The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0.
Python doc say, a string literal with u-prefix is encoded using unicode.
Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?
Thanks in advance for any help.

Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.
As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.

Related

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?
Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.
As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

convert Unicode to normal string [duplicate]

When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.
The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.
You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.
The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Latin1 character values not displaying the same as in utf8

FOR PYTHON 2.7 (I took a shot of using encode in 3 and am all confused now...would love some advice how to replicate this test in python 3....)
For the Euro character (€) I looked up what its utf8 Hex code point was using this tool. It said it was 0x20AC.
For Latin1 (again using Python2 2.7), I used decode to get its Hex code point:
>>import unicodedata
>>p='€'
## notably x80 seems to correspond to [Windows CP1252 according to the link][2]
>>p.decode('latin-1')
>>u'\x80'
Then I used this print statement for both of them, and this is what I got:
for utf8:
>>> print unichr(0x20AC).encode('utf-8')
€
for latin-1:
>>> print unichr(0x80).encode('latin-1')
€
What in the heck happened? Why did encode return '€' for utf-8? Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect). But the presence of different code points seems to suggest otherwise to me...HOWEVER the reason why python 2.7 is reading the Windows CP1252 'x80' is a real mystery to me....is this the standard for latin-1 in python 2.7??
You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.
First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2, 0x82, 0xAC.
And that explains your confusion here:
Why did encode return '€' for utf-8?
It didn't. It returned the byte string '\xE2\x82\xAC'. You then printed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you €.
Meanwhile, when you write this:
p='€'
The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80. So, this is the same as typing:
p='\x80'
But in Latin-1, \x80 isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1'), you get back u'\x80'. Which is exactly what you're seeing.
The reason you can't reproduce this in Python 3 is that in Python 3, str, and plain string literals, are Unicode strings, not byte strings. So, when you write this:
p='€'
… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:
p='\u20ac'
… or this:
p=b'\x80'.decode(sys.stdin.encoding)
Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.
A code point is a Unicode concept. A unicode string in Python is a sequence of code points. A str is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC, or 0x20AC, is the same thing as the decimal number 8364, and the hex number 0x80 is the same thing as the decimal number 128.
That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.
Finally:
Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).
Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.

Unicode utf-8/utf-16 encoding in Python

In python:
u'\u3053\n'
Is it utf-16?
I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset,
like if I have a=u'\u3053\n'.
print gives an exception and
decoding gives an exception.
a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") >  ■S0
What's going on here?
It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.
On a terminal that can display utf-8 you get:
>>> print u'\u3053'
こ
Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.
You ask:
u'\u3053\n'
Is it utf-16?
The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.
To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:
(source: aleax.it)
...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.
Character U+3053 "HIRAGANA LETTER KO".
The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)
The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.
Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?
Here's the Unicode HowTo Doc for Python 2.6.2:
http://docs.python.org/howto/unicode.html
Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.

Categories