Latin1 character values not displaying the same as in utf8 - python

FOR PYTHON 2.7 (I took a shot of using encode in 3 and am all confused now...would love some advice how to replicate this test in python 3....)
For the Euro character (€) I looked up what its utf8 Hex code point was using this tool. It said it was 0x20AC.
For Latin1 (again using Python2 2.7), I used decode to get its Hex code point:
>>import unicodedata
>>p='€'
## notably x80 seems to correspond to [Windows CP1252 according to the link][2]
>>p.decode('latin-1')
>>u'\x80'
Then I used this print statement for both of them, and this is what I got:
for utf8:
>>> print unichr(0x20AC).encode('utf-8')
€
for latin-1:
>>> print unichr(0x80).encode('latin-1')
€
What in the heck happened? Why did encode return '€' for utf-8? Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect). But the presence of different code points seems to suggest otherwise to me...HOWEVER the reason why python 2.7 is reading the Windows CP1252 'x80' is a real mystery to me....is this the standard for latin-1 in python 2.7??

You've got some serious misunderstandings here. If you haven't read the Unicode HOWTOs for Python 2 and Python 3, you should start there.
First, UTF-8 is an encoding of Unicode to 8-bit bytes. There is no such thing as UTF-8 code point 0x20AC. There is a Unicode code point U+20AC, but in UTF-8, that's three bytes: 0xE2, 0x82, 0xAC.
And that explains your confusion here:
Why did encode return '€' for utf-8?
It didn't. It returned the byte string '\xE2\x82\xAC'. You then printed that out to your console. Your console is presumably in CP-1252, so it interpreted those bytes as if they were CP-1252, which gave you €.
Meanwhile, when you write this:
p='€'
The console isn't giving Python Unicode, it's giving Python bytes in CP-1252, which Python just stores as bytes. The CP-1252 for the Euro sign is \x80. So, this is the same as typing:
p='\x80'
But in Latin-1, \x80 isn't the Euro sign, it's an invisible control character, equivalent to Unicode U+0080. So, when you call p.decode('latin-1'), you get back u'\x80'. Which is exactly what you're seeing.
The reason you can't reproduce this in Python 3 is that in Python 3, str, and plain string literals, are Unicode strings, not byte strings. So, when you write this:
p='€'
… the console gives Python some bytes, which Python then automatically decodes with the character set it guessed for the console (CP-1252) into Unicode. So, it's equivalent to writing this:
p='\u20ac'
… or this:
p=b'\x80'.decode(sys.stdin.encoding)
Also, you keep saying "hex code points" to mean a variety of different things, none of which make any sense.
A code point is a Unicode concept. A unicode string in Python is a sequence of code points. A str is a sequence of bytes, not code points. Hex is just a way of representing a number—the hex number 20AC, or 0x20AC, is the same thing as the decimal number 8364, and the hex number 0x80 is the same thing as the decimal number 128.
That sequence of bytes doesn't have any inherent meaning as text on its own; it needs to be combined with an encoding to have a meaning. Depending on the encoding, some code points may not be representable at all, and others may take 2 or more bytes to represent.
Finally:
Also...it seems that Latin1 hex code points CAN be different then their utf8 counterparts (I have a colleague who believes different -- says that Latin1 is just like ASCII in this respect).
Latin-1 is a superset of ASCII. Unicode is also a superset of the printable subset of Latin-1; some of the Unicode characters up to U+FF (and all printable characters up to U+7F) are encoded in UTF-8 as the byte with the same value as the code point, but not all. CP-1252 is a different superset of the printable subset of Latin-1. Since there is no Euro sign in either ASCII or Latin-1, it's perfectly reasonable for CP-1252 and UTF-8 to represent it differently.

Related

Python utf-8 encoding not following unicode rules

Background: I've got a byte file that is encoded using unicode. However, I can't figure out the right method to get Python to decode it to a string. Sometimes is uses 1-byte ASCII text. The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character. So my python program needs to be able to decode that and handle it. Unfortunately byte_string.decode('unicode') isn't a thing, so I need to specify another encoding scheme. Using Python 3.9
I've read through the Python doc on unicode and utf-8 Python doc. If Python uses unicode for it's strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.
If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table. So I would expect 0x00_41 to decode to "A",
0x00_F2 =>
x65_03_01 => é (e with combining acute accent).
I wrote a short test file to experiment with these byte combinations, and I'm running into a few situations that I don't understand (despite extensive reading).
Example code:
def main():
print("Starting MAIN...")
vrsn_bytes = b'\x76\x72\x73\x6E'
serato_bytes = b'\x00\x53\x00\x65\x00\x72\x00\x61\x00\x74\x00\x6F'
special_bytes = b'\xB2\xF2'
combining_bytes = b'\x41\x75\x64\x65\x03\x01'
print(f"vrsn_bytes: {vrsn_bytes}")
print(f"serato_bytes: {serato_bytes}")
print(f"special_bytes: {special_bytes}")
print(f"combining_bytes: {combining_bytes}")
encoding_method = 'utf-8' # also tried latin-1 and cp1252
vrsn_str = vrsn_bytes.decode(encoding_method)
serato_str = serato_bytes.decode(encoding_method)
special_str = special_bytes.decode(encoding_method)
combining_str = combining_bytes.decode(encoding_method)
print(f"vrsn_str: {vrsn_str}")
print(f"serato_str: {serato_str}")
print(f"special_str: {special_str}")
print(f"combining_str: {combining_str}")
return True
if __name__ == '__main__':
print("Starting Command Line Experiment!")
if not main():
print("\n Command Line Test FAILED!!")
else:
print("\n Command Line Test PASSED!!")
Issue 1: utf-8 encoding. As the experiment is written, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
I don't understand why this fails to decode, according to the unicode decode table, 0x00B2 should be "SUPERSCRIPT TWO". In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.
I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.
If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:
vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude
However, this encoding is not handling the 2-byte codes properly. \x00_53 should be S, not �S, and none of the encoding methods I'll mention in this post handle the combining acute accent after Aude properly.
So far I've tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252. while I expect utf-8 to be what I'm supposed to use, it does not behave like it's described in the Python doc linked above.
Any help is appreciated. Besides trying more methods, I don't understand why this isn't decoding according to the table in link 3.
UPDATE:
After some more reading, and see your responses, I understand why you're so confused. I'm going to explain further so that hopefully this helps someone in the future.
The byte file that I'm decoding is not mine (hence why the encoding does not make sense). What I see now is that the bytes represent the code point, not the byte representation of the unicode character.
For example: I want 0x00_B2 to translate to ò. But the actual byte representation of ò is 0xC3_B2. What I have is the integer representation of the code point. So while I was trying to decode, what I actually need to do is convert 0x00B2 to an integer = 178. then I can use chr(178) to convert to unicode.
I don't know why the file was written this way, and I can't change it. But I see now why the decoding wasn't working. Hopefully this helps someone avoid the frustration I've been figuring out.
Thanks!
This isn't actually a python issue, it's how you're encoding the character. To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.
For example, the code point U+2192 is →. The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010
As we can see, this is 3 bytes, not 2 as we'd expect if we only used the position. To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:
https://onlineunicodetools.com/convert-unicode-to-binary
This will let you input a unicode character and get the utf-8 binary representation.
To get correct output for ò, we need to use 0xC3B2.
>>> s = b'\xC3\xB2'
>>> print(s.decode('utf-8'))
ò
The reason why you can't use the direct binary representation is because of the header for the bytes. In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints. For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points. So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.
For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte. If we combine those, we get
00010 000000, which is equivalent to 0x80.

Python encoding in vim

Trying to understand encoding/decoding/unicode business in Python2.7 with vim.
I have a unicode string us to which I assign some unicode string u'é'.
Question 1
How is us represented in memory? Is it a sequence of 32- bits long ints that unicode code points \u should consist of? Or is it kept in memory as a sequence of 8- bits long hex values \x in some default encoding?
Question 2
I see four different ways to set encoding for the unicode string us: #1 in the beginning of the test.py file; #2 as an argument of encode function; #3 as an argument for vim; #4 as a local encoding of the file system. So, what do all these four encodings (#1,#2,#3,#4) do?
$ vim test.py
_____________
#encoding: #1
us=u'é'
print us.encode(encoding='#2')
_____________
:set encoding=#3
$ locale | grep LANG
LANG=en_US.#4
LANGUAGE=
In Python 2.x unicodes are encoded as either UCS-2 or UCS-4 depending on the options used when building it.
Source encoding as far as Python is concerned.
Encoding used to encode us as bytes when the code is executed.
Source encoding as far as vim is concerned. If this doesn't match #1 then expect trouble.
System encoding. Mostly affects filesystem and terminal output operations.
Question 1 - Storage
us = u'é'
This creates a Unicode character with a value of é - In Python 2.2+ Unicode characters are stored in UCS-2 or UCS-4 which use 2 or 4 byte long unsigned integers depending on a build time option.
Python 3.3+ uses UTF-8 which uses between 1 & 4 bytes for each character depending on the range it is in.
The storage of Unicode strings now depends on the highest codepoint in
the string:
pure ASCII and Latin1 strings (U+0000-U+007F) use 1 byte per codepoint 0xxxxxxx;
BMP strings partial (U+0080-U+07FF) use 2 bytes per codepoint 110xxxxx 10xxxxxx;
BMP strings remaining (U+0800-U+FFFF) use 3 bytes per codepoint 1110xxxx 10xxxxxx 10xxxxxx;
Other Plains (U+10000-U+10FFFF) use 4 bytes per codepoint 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
Question 2 - Encoding
us=u'é'
Declares us to be a Unicode string stored as above, note that in python 3 all strings are by default Unicode so the u can be omitted.
print(us.encode('ascii', strict)) # encoding='#2')
Tells print how to attempt to translate the Unicode string for output, note that if you are using Python 3.3+ and a Unicode capable terminal/console you probably don't need to ever use this.
#set encoding=#3
Tells vim, emacs and a number of editors the encoding to use when displaying &/or editing the file applies to all text files not just python.
$ locale | grep LANG
LANG=en_US.#4
Is an operating system setting for the Locale Language that tells it how to display various things specifically which code page to use when displaying extended ASCII.
This doesn't actually answer the question but I'm hoping it gives some more insight into this problem.
Answer to question 1: it shouldn't matter to the programmer how Unicode strings are represented internally in Python.
To question 2:
All the programmer should care about is that the data sink and source encoding requirements are known and correctly specified. I would assume that Python can correctly interpret UTF encoded files by reading the BOM and maybe even by making educated guesses but without the BOM it can be ambiguous how to handle bytes with the high bit set so it's advisable to either make sure the BOM is there or tell Python that the file is UTF-8 encoded if you're not sure.
There's a difference between "Unicode" and "UTF" that seems to be glossed-over above; "UTF" specifies the representation in storage (disk, memory, network packet) but "Unicode" is simply the fact that each character has a single value (code point) that ranges from 0 to 0x10FFFF. The various flavors of UTF encode that value into the appropriate storage. Working with encoded strings can be annoying though (as the character width is variable) so when strings are actually represented in memory often it's easier to expand them into some format that allows for easy manipulation. (This is touched on in a comment on another answer.)
If you want a Unicode string in Python pre-3, just type u'<whatever>' and in 3+ type '<whatever>'. You'll get Unicode and you can use \uXXXX and \UXXXXXXXX escapes if it's infeasible to just type the characters in directly. When you want to write the data, specify the encoding. UTF-8 is often the easiest to deal with and seems to be the most commonly used but you may have reason to use a UTF-16 flavor.
The takeaway here is that the encoding is just a way to transform Unicode data so that it can be persisted. The various flavors of UTF are just the encodings, they are not actually Unicode.

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?
Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.
As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

python u'\u00b0' returns u'\xb0'. Why?

I use python 2.7.10.
On dealing with character encoding, and after reading a lot of stack-overflow etc. etc. on the subject, I encountered this behaviour which looks strange to me. Python interpreter input
>>>u'\u00b0'
results in the following output:
u'\xb0'
I could repeat this behaviour using a dos window, the idle console, and the wing-ide python shell.
My assumptions (correct me if I am wrong):
The "degree symbol" has unicode 0x00b0, utf-8 code 0xc2b0, latin-1 code 0xb0.
Python doc say, a string literal with u-prefix is encoded using unicode.
Question: Why is the result converted to a unicode-string-literal with a byte-escape-sequence which matches the latin-1 encoding, instead of persisting the unicode escape sequence ?
Thanks in advance for any help.
Python uses some rules for determining what to output from repr for each character. The rule for Unicode character codepoints in the 0x0080 to 0x00ff range is to use the sequence \xdd where dd is the hex code, at least in Python 2. There's no way to change it. In Python 3, all printable characters will be displayed without converting to a hex code.
As for why it looks like Latin-1 encoding, it's because Unicode started with Latin-1 as the base. All the codepoints up to 0xff match their Latin-1 counterpart.

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.
# The char in the example is á
print len(char)
OUTPUT:
2
I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.
# In this example instr = "á" (including the quotes)
for char in instr:
print hex(int(ord(char)))
OUTPUT:
0x22
0xc3
0xa1
0x22
As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:
OUTPUT:
0x22
0xe1
0x22
Is there anyway to make the output the same on both machines? The script is exactly the same on each.
The program is not being given the same input on the two machines:
In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True
When you type á in a console, you may see the glyph á, but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252, while on a Unix machine it is likely to be utf-8.
So you may see the input as the same, but the console (and thus the program) receives different input.
If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin, then sys.stdin.encoding will be the encoding Python detects the console is using.
You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?
The character á is in fact U+00E1, so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).
In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \xc3\xa1. On that system, len(char) will be 4, and you would see your first output.
The issue is that you use bytestrings to work with a text data. You should use Unicode instead.
It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text.
If you know the character encoding then it is easy to convert a bytestring to Unicode e.g.:
unicode_text = bytestring.decode(encoding)
It should resolve your initial issue.
There are also Unicode normalization forms e.g.:
import unicodedata
norm_text = unicodedata.normalize('NFC', unicode_text)
If I don't change the encoding in the program how can I output unicode characters for example?
You might mean that you have a sequence of bytes e.g., '\xc3\xa1' (two bytes) that can be interpreted as text using some character encoding e.g., it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted e.g., instead of á you might get ├б on the screen.
In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

Categories