Character encoding with Python 3 - python

If I run
print(chr(244).encode())
I get the two-byte result b'\xc3\xb4'. Why is that? I imagine the number 244 can be encoded into one byte!

Your default locale appears to use UTF-8 as the output encoding.
Any codepoint outside the range 0-127 is encoded with multiple bytes in the variable-width UTF-8 codec.
You'll have to use a different codec to encode that codepoint to one byte. The Latin-1 encoding can manage it just fine, while the EBCDIC 500 codec (codepage 500) can too, but encodes to a different byte:
>>> print(chr(244).encode('utf8'))
b'\xc3\xb4'
>>> print(chr(244).encode('latin1'))
b'\xf4'
>>> print(chr(244).encode('cp500'))
b'\xcb'
But Latin-1 and EBCDIC 500 codecs can only encode 255 codepoints; UTF-8 can manage all of the Unicode standard.
If you were expecting the number 244 to be interpreted as a byte value instead, then you should not use chr().encode(); chr() produces a unicode value, not a 'byte', and encoding then produces a different result depending on the exact codec. That's because unicode values are text, not bytes.
Pass your number as a list of integers to the bytes() callable instead:
>>> bytes([244])
b'\xf4'
This only happens to fit the Latin-1 codec result, because the first 256 Unicode codepoints map directly to Latin 1 bytes, by design.

Character #244 is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX which is indeed encoded as 0xc3 0xb4 in UTF-8. If you want to use a single-byte encoding then you need to specify it.

I imagine the number 244 can be encoded into one byte!
Sure, if you design an encoding that can only handle 256 code points, all of them can be encoded into one byte.
But if you design an encoding that can handle all of Unicode's 111000+ code points, obviously you can't pack all of them into one byte.
If your only goal were to make things as compact as possible, you could use most of the 256 initial byte values for common code points, and only reserve a few as start bytes for less common code points.
However, if you only use the lower 128 for single-byte values, there are some big advantages. Especially if you design it so that every byte is unambiguously either a 7-bit character, a start byte, or a continuation byte. That makes the algorithm is a lot simpler to implement and faster, you can always scan forward or backward to the start of a character, you can search for ASCII text in a string with traditional byte-oriented (strchr) searches, a simple heuristic can detect your encoding very reliably, you can always detect truncated string start/end instead of misinterpreting it, etc. So, that's exactly what UTF-8 does.
Wikipedia explains UTF-8 pretty well. Rob Pike, one of the inventors of UTF-8, explains the design history in detail.

Related

Python utf-8 encoding not following unicode rules

Background: I've got a byte file that is encoded using unicode. However, I can't figure out the right method to get Python to decode it to a string. Sometimes is uses 1-byte ASCII text. The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character. So my python program needs to be able to decode that and handle it. Unfortunately byte_string.decode('unicode') isn't a thing, so I need to specify another encoding scheme. Using Python 3.9
I've read through the Python doc on unicode and utf-8 Python doc. If Python uses unicode for it's strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.
If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table. So I would expect 0x00_41 to decode to "A",
0x00_F2 =>
x65_03_01 => é (e with combining acute accent).
I wrote a short test file to experiment with these byte combinations, and I'm running into a few situations that I don't understand (despite extensive reading).
Example code:
def main():
print("Starting MAIN...")
vrsn_bytes = b'\x76\x72\x73\x6E'
serato_bytes = b'\x00\x53\x00\x65\x00\x72\x00\x61\x00\x74\x00\x6F'
special_bytes = b'\xB2\xF2'
combining_bytes = b'\x41\x75\x64\x65\x03\x01'
print(f"vrsn_bytes: {vrsn_bytes}")
print(f"serato_bytes: {serato_bytes}")
print(f"special_bytes: {special_bytes}")
print(f"combining_bytes: {combining_bytes}")
encoding_method = 'utf-8' # also tried latin-1 and cp1252
vrsn_str = vrsn_bytes.decode(encoding_method)
serato_str = serato_bytes.decode(encoding_method)
special_str = special_bytes.decode(encoding_method)
combining_str = combining_bytes.decode(encoding_method)
print(f"vrsn_str: {vrsn_str}")
print(f"serato_str: {serato_str}")
print(f"special_str: {special_str}")
print(f"combining_str: {combining_str}")
return True
if __name__ == '__main__':
print("Starting Command Line Experiment!")
if not main():
print("\n Command Line Test FAILED!!")
else:
print("\n Command Line Test PASSED!!")
Issue 1: utf-8 encoding. As the experiment is written, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
I don't understand why this fails to decode, according to the unicode decode table, 0x00B2 should be "SUPERSCRIPT TWO". In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.
I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.
If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:
vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude
However, this encoding is not handling the 2-byte codes properly. \x00_53 should be S, not �S, and none of the encoding methods I'll mention in this post handle the combining acute accent after Aude properly.
So far I've tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252. while I expect utf-8 to be what I'm supposed to use, it does not behave like it's described in the Python doc linked above.
Any help is appreciated. Besides trying more methods, I don't understand why this isn't decoding according to the table in link 3.
UPDATE:
After some more reading, and see your responses, I understand why you're so confused. I'm going to explain further so that hopefully this helps someone in the future.
The byte file that I'm decoding is not mine (hence why the encoding does not make sense). What I see now is that the bytes represent the code point, not the byte representation of the unicode character.
For example: I want 0x00_B2 to translate to ò. But the actual byte representation of ò is 0xC3_B2. What I have is the integer representation of the code point. So while I was trying to decode, what I actually need to do is convert 0x00B2 to an integer = 178. then I can use chr(178) to convert to unicode.
I don't know why the file was written this way, and I can't change it. But I see now why the decoding wasn't working. Hopefully this helps someone avoid the frustration I've been figuring out.
Thanks!
This isn't actually a python issue, it's how you're encoding the character. To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.
For example, the code point U+2192 is →. The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010
As we can see, this is 3 bytes, not 2 as we'd expect if we only used the position. To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:
https://onlineunicodetools.com/convert-unicode-to-binary
This will let you input a unicode character and get the utf-8 binary representation.
To get correct output for ò, we need to use 0xC3B2.
>>> s = b'\xC3\xB2'
>>> print(s.decode('utf-8'))
ò
The reason why you can't use the direct binary representation is because of the header for the bytes. In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints. For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points. So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.
For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte. If we combine those, we get
00010 000000, which is equivalent to 0x80.

encoding string in python [duplicate]

I have this issue and I can't figure out how to solve it. I have this string:
data = '\xc4\xb7\x86\x17\xcd'
When I tried to encode it:
data.encode()
I get this result:
b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'
I only want:
b'\xc4\xb7\x86\x17\xcd'
Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.
You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.
So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.
If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.
If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.
As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:
raw_unicode_escape
Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.
So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:
>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'
So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).
So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:
>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)
Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.
You can use 'raw_unicode_escape' as your encoding:
In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'
As mentioned in comments you can also pass the encoding directly to the encode method of your string.
In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'

Why can't I decode any byte using utf-8?

Not an expert in encoding, trying to learn.
I got a file in latin encoding, when trying to read it and decode using 'utf-8' I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
Why can't utf-8 (that uses 1-4 bytes per character) can't decode a latin1 (1 byte per character).
Is latin1 a subset of utf-8? I am pretty lost regarding this so anything helps.
There are two different concepts being confused.
A character set is an ordering of characters, numbered from 0 to ...
An encoding is a way of representing the numbers in a character set as a sequence of bytes.
For a character set with at most 256 characters, a trivial single-byte encoding is possible.
For larger character sets, multi-byte encodings are required. They can be split into two types: fixed size, where every character uses the same number of bytes, and variable size, where different numbers of bytes are used to represent different characters.
Examples of fixed-size encodings are all single-byte encodings and UTF-32.
Examples of variable-sized encodings are UTF-8 and the various UTF-16 encodings.
Latin-1 is both a character set (containing the ASCII characters and its additional characters for writing Western European languages) and (having 256 characters) its corresponding single-byte encoding
Unicode is a character set containing (or aiming to contain) all characters to write all known languages. It is, not surprisingly, much larger than 256 characters.
UTF-8 is just one multibyte encoding of Unicode, and a variable sized one. The first byte of each UTF-8 sequence tells you how may additional bytes follow it to encode a single Unicode code point.
Unicode and Latin-1 (the character set) coincide for their first 256 code points. That is, Latin-1 is a subset of Unicode.
UTF-8 and Latin-1 coincide for their first 128 sequences. After that, they diverge. In UTF-8, code points 128 through 255 require two bytes. The first byte has the form 110xxxxx, the second 10xxxxxx. The 11 x bits are free for encoding the rest of the Latin-1 subset (plus additional blocks) of Unicode.
The byte 0xa0 is not valid UTF-8, as its binary expansion is 10100000. The first byte of a multibyte UTF-8 sequence always starts with at least two 1s (intuitively, the number of 1s indicates the number of bytes in the sequence).

How does decoding in UTF-8 know the byte boundaries?

I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.
How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4 and \xe2\x98\x82, respectively. I then write this unicode string to a file, so the file now contains the bytes
\xc6\xb4\xe2\x98\x82. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.
How does the decoding know to interpret the bytes \xc6\xb4 and not \xc6\xb4\xe2?
The byte boundaries are easily determined from the bit patterns. In your case, \xc6 starts with the bits 1100, and \xe2 starts with 1110. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1 bits at the start before the first 0. So your first character has 2 bytes and the second one has 3 bytes.
If a byte starts with 0, it is a regular ASCII character.
If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).

Is the Unicode code point value equal to the UTF-16BE representation for every character?

I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.
But I think that the Unicode code point value is different to UTF-16BE — one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.
Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?
No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).
For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.
For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.
See the UTF-16 Wikipedia article on the nitty gritty details.
So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.

Categories