I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.
The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.
You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.
The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.
Related
Trying to understand encoding/decoding/unicode business in Python2.7 with vim.
I have a unicode string us to which I assign some unicode string u'é'.
Question 1
How is us represented in memory? Is it a sequence of 32- bits long ints that unicode code points \u should consist of? Or is it kept in memory as a sequence of 8- bits long hex values \x in some default encoding?
Question 2
I see four different ways to set encoding for the unicode string us: #1 in the beginning of the test.py file; #2 as an argument of encode function; #3 as an argument for vim; #4 as a local encoding of the file system. So, what do all these four encodings (#1,#2,#3,#4) do?
$ vim test.py
_____________
#encoding: #1
us=u'é'
print us.encode(encoding='#2')
_____________
:set encoding=#3
$ locale | grep LANG
LANG=en_US.#4
LANGUAGE=
In Python 2.x unicodes are encoded as either UCS-2 or UCS-4 depending on the options used when building it.
Source encoding as far as Python is concerned.
Encoding used to encode us as bytes when the code is executed.
Source encoding as far as vim is concerned. If this doesn't match #1 then expect trouble.
System encoding. Mostly affects filesystem and terminal output operations.
Question 1 - Storage
us = u'é'
This creates a Unicode character with a value of é - In Python 2.2+ Unicode characters are stored in UCS-2 or UCS-4 which use 2 or 4 byte long unsigned integers depending on a build time option.
Python 3.3+ uses UTF-8 which uses between 1 & 4 bytes for each character depending on the range it is in.
The storage of Unicode strings now depends on the highest codepoint in
the string:
pure ASCII and Latin1 strings (U+0000-U+007F) use 1 byte per codepoint 0xxxxxxx;
BMP strings partial (U+0080-U+07FF) use 2 bytes per codepoint 110xxxxx 10xxxxxx;
BMP strings remaining (U+0800-U+FFFF) use 3 bytes per codepoint 1110xxxx 10xxxxxx 10xxxxxx;
Other Plains (U+10000-U+10FFFF) use 4 bytes per codepoint 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Question 2 - Encoding
us=u'é'
Declares us to be a Unicode string stored as above, note that in python 3 all strings are by default Unicode so the u can be omitted.
print(us.encode('ascii', strict)) # encoding='#2')
Tells print how to attempt to translate the Unicode string for output, note that if you are using Python 3.3+ and a Unicode capable terminal/console you probably don't need to ever use this.
#set encoding=#3
Tells vim, emacs and a number of editors the encoding to use when displaying &/or editing the file applies to all text files not just python.
$ locale | grep LANG
LANG=en_US.#4
Is an operating system setting for the Locale Language that tells it how to display various things specifically which code page to use when displaying extended ASCII.
This doesn't actually answer the question but I'm hoping it gives some more insight into this problem.
Answer to question 1: it shouldn't matter to the programmer how Unicode strings are represented internally in Python.
To question 2:
All the programmer should care about is that the data sink and source encoding requirements are known and correctly specified. I would assume that Python can correctly interpret UTF encoded files by reading the BOM and maybe even by making educated guesses but without the BOM it can be ambiguous how to handle bytes with the high bit set so it's advisable to either make sure the BOM is there or tell Python that the file is UTF-8 encoded if you're not sure.
There's a difference between "Unicode" and "UTF" that seems to be glossed-over above; "UTF" specifies the representation in storage (disk, memory, network packet) but "Unicode" is simply the fact that each character has a single value (code point) that ranges from 0 to 0x10FFFF. The various flavors of UTF encode that value into the appropriate storage. Working with encoded strings can be annoying though (as the character width is variable) so when strings are actually represented in memory often it's easier to expand them into some format that allows for easy manipulation. (This is touched on in a comment on another answer.)
If you want a Unicode string in Python pre-3, just type u'<whatever>' and in 3+ type '<whatever>'. You'll get Unicode and you can use \uXXXX and \UXXXXXXXX escapes if it's infeasible to just type the characters in directly. When you want to write the data, specify the encoding. UTF-8 is often the easiest to deal with and seems to be the most commonly used but you may have reason to use a UTF-16 flavor.
The takeaway here is that the encoding is just a way to transform Unicode data so that it can be persisted. The various flavors of UTF are just the encodings, they are not actually Unicode.
I use python 2.7.10.
On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input
>>>u'\u00b0'
results in the following output:
u'\xb0'
I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.
My assumptions (correct me if I am wrong):
The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0.
Python doc say, a string literal with u-prefix is encoded using unicode.
Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?
Thanks in advance for any help.
Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.
As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.
When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.
FOR PYTHON 2.7 (I took a shot of using encode in 3 and am all confused now...would love some advice how to replicate this test in python 3....)
For the Euro character (€) I looked up what its utf8 Hex code point was using this tool. It said it was 0x20AC.
For Latin1 (again using Python2 2.7), I used decode to get its Hex code point:
>>import unicodedata
>>p='€'
## notably x80 seems to correspond to [Windows CP1252 according to the link][2]
>>p.decode('latin-1')
>>u'\x80'
Then I used this print statement for both of them, and this is what I got:
for utf8:
>>> print unichr(0x20AC).encode('utf-8')
€
for latin-1:
>>> print unichr(0x80).encode('latin-1')
€
What in the heck happened? Why did encode return '€' for utf-8? Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect). But the presence of different code points seems to suggest otherwise to me...HOWEVER the reason why python 2.7 is reading the Windows CP1252 'x80' is a real mystery to me....is this the standard for latin-1 in python 2.7??
You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.
First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2, 0x82, 0xAC.
And that explains your confusion here:
Why did encode return '€' for utf-8?
It didn't. It returned the byte string '\xE2\x82\xAC'. You then printed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you €.
Meanwhile, when you write this:
p='€'
The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80. So, this is the same as typing:
p='\x80'
But in Latin-1, \x80 isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1'), you get back u'\x80'. Which is exactly what you're seeing.
The reason you can't reproduce this in Python 3 is that in Python 3, str, and plain string literals, are Unicode strings, not byte strings. So, when you write this:
p='€'
… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:
p='\u20ac'
… or this:
p=b'\x80'.decode(sys.stdin.encoding)
Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.
A code point is a Unicode concept. A unicode string in Python is a sequence of code points. A str is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC, or 0x20AC, is the same thing as the decimal number 8364, and the hex number 0x80 is the same thing as the decimal number 128.
That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.
Finally:
Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).
Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.
In python:
u'\u3053\n'
Is it utf-16?
I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset,
like if I have a=u'\u3053\n'.
print gives an exception and
decoding gives an exception.
a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'
print a.encode("utf-8") > πüô
print a.encode("utf-16") > ■S0
What's going on here?
It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.
On a terminal that can display utf-8 you get:
>>> print u'\u3053'
こ
Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.
You ask:
u'\u3053\n'
Is it utf-16?
The answer is no: it's unicode, not any specific encoding. utf-16 is an encoding.
To print a Unicode string effectively to your terminal, you need to find out what encoding that terminal is willing to accept and able to display. For example, the Terminal.app on my laptop is set to UTF-8 and with a rich font, so:
(source: aleax.it)
...the Hiragana letter displays correctly. On a Linux workstation I have a terminal program that keeps resetting to Latin-1 so it would mangle things somewhat like yours -- I can set it to utf-8, but it doesn't have huge number of glyphs in the font, so it would display somewhat-useless placeholder glyphs instead.
Character U+3053 "HIRAGANA LETTER KO".
The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)
The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.
Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?
Here's the Unicode HowTo Doc for Python 2.6.2:
http://docs.python.org/howto/unicode.html
Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.