Not an expert in encoding, trying to learn.
I got a file in latin encoding, when trying to read it and decode using 'utf-8' I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
Why can't utf-8 (that uses 1-4 bytes per character) can't decode a latin1 (1 byte per character).
Is latin1 a subset of utf-8? I am pretty lost regarding this so anything helps.
There are two different concepts being confused.
A character set is an ordering of characters, numbered from 0 to ...
An encoding is a way of representing the numbers in a character set as a sequence of bytes.
For a character set with at most 256 characters, a trivial single-byte encoding is possible.
For larger character sets, multi-byte encodings are required. They can be split into two types: fixed size, where every character uses the same number of bytes, and variable size, where different numbers of bytes are used to represent different characters.
Examples of fixed-size encodings are all single-byte encodings and UTF-32.
Examples of variable-sized encodings are UTF-8 and the various UTF-16 encodings.
Latin-1 is both a character set (containing the ASCII characters and its additional characters for writing Western European languages) and (having 256 characters) its corresponding single-byte encoding
Unicode is a character set containing (or aiming to contain) all characters to write all known languages. It is, not surprisingly, much larger than 256 characters.
UTF-8 is just one multibyte encoding of Unicode, and a variable sized one. The first byte of each UTF-8 sequence tells you how may additional bytes follow it to encode a single Unicode code point.
Unicode and Latin-1 (the character set) coincide for their first 256 code points. That is, Latin-1 is a subset of Unicode.
UTF-8 and Latin-1 coincide for their first 128 sequences. After that, they diverge. In UTF-8, code points 128 through 255 require two bytes. The first byte has the form 110xxxxx, the second 10xxxxxx. The 11 x bits are free for encoding the rest of the Latin-1 subset (plus additional blocks) of Unicode.
The byte 0xa0 is not valid UTF-8, as its binary expansion is 10100000. The first byte of a multibyte UTF-8 sequence always starts with at least two 1s (intuitively, the number of 1s indicates the number of bytes in the sequence).
Related
How can I write all the characters of unicode or utf-8 in a file, one by one without space between or break?
file:
0123456789!"#$%&'()ABCDEFGHIJKLMNOPQRSTUVWXYZ
...And so on from 0-100000
z=""
for y in range(0, 100000):
z+=chr(y)
open("./aa", "w").write(z)
# UnicodeEncodeError: 'utf-8' codec can't encode characters in position 55296-57343: surrogates not allowed
for z in range(0, 100000):
print(chr(z))
# UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
Unicode is HARD! Ugh! I'm serious, I feel your pain. Every time I think I understand it, there's more... Ugh!
Anyway, I think the problem you're hitting is described in the following couple links (link1 and link2). See the colorful picture on link2.
The issue is that the basic multilingual plane (BMP), which contains code-points U+0000 through U+FFFF (or 0 through 65535 or 2^0 - 1 to 2^16 - 1) only contains 55503 characters. What gives?
Well, some code-points are saved for "private use" and some are utf-16 surrogates. I don't fully grasp what those are, but they're a pair of utf8 code-points encoded for utf16. So, they make no sense as individual characters for utf8. In the image, the surrogates start right at your magic number of 55296 (or U+D800).
I ran your code while using utf16 and got the same error, so there's clearly some more reading to be done on the surrogate code-points. For now, if you're only considering utf8, those regions of code-points are not valid characters, so you may want to just skip them.
In Python (either 2 or 3), evaluating b'\xe2\x80\x8f'.decode("utf-8")
yields \u200f, and similarly '\u200f'.encode("utf-8") yields b'\xe2\x80\x8f'.
The first looks like a chain of three 2-character hex values that would equal decimal 226, 128, and 143. The second looks like a single hex value that would equal decimal 8,207.
Is there a logical relationship between '\xe2\x80\x8f' and '\u200f' ? Am I interpreting the values incorrectly?
I can see the values are linked somehow in tables like this one: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string-literal
but why are these two values on the same row?
The difference is related to the amount of bits/bytes that each character takes up to represent in utf-8.
For any character equal to or below 127 (hex 0x7F), the UTF-8
representation is one byte. It is just the lowest 7 bits of the full
unicode value. This is also the same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
There is more information about this here.
If you wanted more info on how Python uses these values, check out here.
Yes, the first is "a chain of three 2-character hex values that would equal decimal 226, 128, and 143." It's a byte string. You got a byte string because that's what encode does. You passed it UTF-8 so the bytes are the UTF-8 encoding for the input character string.
"The second looks like a single hex value that would equal decimal 8,207." Sort of; It's the notation for a UTF-16 code unit inside a literal character string. One or two UTF-16 code units encode a Unicode codepoint. In this case, only one is used for the corresponding codepoint.
Sure, you can convert the hex to decimal but that's not very common or useful in either case. A code unit is a specific bit pattern. Bytes are that bit pattern as an integer, serialized to a byte sequence.
The Unicode codepoint range needs 21 bits. UTF-16 encodes a codepoint in one or two 16-bit code units (so that's two bytes in some byte order for each code unit). UTF-8 encodes a codepoint in one, two, three or four 8-bit code units. (An 8-bit integer is one byte so byte order is moot.) Each character encoding has a separate algorithm to distribute the 21 bits into as many bytes are needed. Both are reversible and fully support the Unicode character set. So, you could directly convert one to the other.
The table you reference doesn't show UTF-16. It shows Unicode codepoint hex notation: U+200F. That notation is for humans to identify codepoints. It so happens that when UTF-16 encodes a codepoint in one code unit, it's number is the same as the codepoint's number.
I've been doing a bunch of reading on unicode encodings, especially with regards to Python. I think I have a pretty strong understanding of it now, but there's still one small detail I'm a little unsure about.
How does the decoding know the byte boundaries? For example, say I have a unicode string with two unicode characters with byte representations of \xc6\xb4 and \xe2\x98\x82, respectively. I then write this unicode string to a file, so the file now contains the bytes
\xc6\xb4\xe2\x98\x82. Now I decide to open and read the file (and Python defaults to decoding the file as utf-8), which leads me to my main question.
How does the decoding know to interpret the bytes \xc6\xb4 and not \xc6\xb4\xe2?
The byte boundaries are easily determined from the bit patterns. In your case, \xc6 starts with the bits 1100, and \xe2 starts with 1110. In UTF-8 (and I'm pretty sure this is not an accident), you can determine the number of bytes in the whole character by looking only at the first byte and counting the number of 1 bits at the start before the first 0. So your first character has 2 bytes and the second one has 3 bytes.
If a byte starts with 0, it is a regular ASCII character.
If a byte starts with 10, it is part of a UTF-8 sequence (not the first character).
I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.
But I think that the Unicode code point value is different to UTF-16BE — one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.
Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?
No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).
For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.
For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.
See the UTF-16 Wikipedia article on the nitty gritty details.
So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.
If I run
print(chr(244).encode())
I get the two-byte result b'\xc3\xb4'. Why is that? I imagine the number 244 can be encoded into one byte!
Your default locale appears to use UTF-8 as the output encoding.
Any codepoint outside the range 0-127 is encoded with multiple bytes in the variable-width UTF-8 codec.
You'll have to use a different codec to encode that codepoint to one byte. The Latin-1 encoding can manage it just fine, while the EBCDIC 500 codec (codepage 500) can too, but encodes to a different byte:
>>> print(chr(244).encode('utf8'))
b'\xc3\xb4'
>>> print(chr(244).encode('latin1'))
b'\xf4'
>>> print(chr(244).encode('cp500'))
b'\xcb'
But Latin-1 and EBCDIC 500 codecs can only encode 255 codepoints; UTF-8 can manage all of the Unicode standard.
If you were expecting the number 244 to be interpreted as a byte value instead, then you should not use chr().encode(); chr() produces a unicode value, not a 'byte', and encoding then produces a different result depending on the exact codec. That's because unicode values are text, not bytes.
Pass your number as a list of integers to the bytes() callable instead:
>>> bytes([244])
b'\xf4'
This only happens to fit the Latin-1 codec result, because the first 256 Unicode codepoints map directly to Latin 1 bytes, by design.
Character #244 is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX which is indeed encoded as 0xc3 0xb4 in UTF-8. If you want to use a single-byte encoding then you need to specify it.
I imagine the number 244 can be encoded into one byte!
Sure, if you design an encoding that can only handle 256 code points, all of them can be encoded into one byte.
But if you design an encoding that can handle all of Unicode's 111000+ code points, obviously you can't pack all of them into one byte.
If your only goal were to make things as compact as possible, you could use most of the 256 initial byte values for common code points, and only reserve a few as start bytes for less common code points.
However, if you only use the lower 128 for single-byte values, there are some big advantages. Especially if you design it so that every byte is unambiguously either a 7-bit character, a start byte, or a continuation byte. That makes the algorithm is a lot simpler to implement and faster, you can always scan forward or backward to the start of a character, you can search for ASCII text in a string with traditional byte-oriented (strchr) searches, a simple heuristic can detect your encoding very reliably, you can always detect truncated string start/end instead of misinterpreting it, etc. So, that's exactly what UTF-8 does.
Wikipedia explains UTF-8 pretty well. Rob Pike, one of the inventors of UTF-8, explains the design history in detail.