In Python (either 2 or 3), evaluating b'\xe2\x80\x8f'.decode("utf-8")
yields \u200f, and similarly '\u200f'.encode("utf-8") yields b'\xe2\x80\x8f'.
The first looks like a chain of three 2-character hex values that would equal decimal 226, 128, and 143. The second looks like a single hex value that would equal decimal 8,207.
Is there a logical relationship between '\xe2\x80\x8f' and '\u200f' ? Am I interpreting the values incorrectly?
I can see the values are linked somehow in tables like this one: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string-literal
but why are these two values on the same row?
The difference is related to the amount of bits/bytes that each character takes up to represent in utf-8.
For any character equal to or below 127 (hex 0x7F), the UTF-8
representation is one byte. It is just the lowest 7 bits of the full
unicode value. This is also the same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
There is more information about this here.
If you wanted more info on how Python uses these values, check out here.
Yes, the first is "a chain of three 2-character hex values that would equal decimal 226, 128, and 143." It's a byte string. You got a byte string because that's what encode does. You passed it UTF-8 so the bytes are the UTF-8 encoding for the input character string.
"The second looks like a single hex value that would equal decimal 8,207." Sort of; It's the notation for a UTF-16 code unit inside a literal character string. One or two UTF-16 code units encode a Unicode codepoint. In this case, only one is used for the corresponding codepoint.
Sure, you can convert the hex to decimal but that's not very common or useful in either case. A code unit is a specific bit pattern. Bytes are that bit pattern as an integer, serialized to a byte sequence.
The Unicode codepoint range needs 21 bits. UTF-16 encodes a codepoint in one or two 16-bit code units (so that's two bytes in some byte order for each code unit). UTF-8 encodes a codepoint in one, two, three or four 8-bit code units. (An 8-bit integer is one byte so byte order is moot.) Each character encoding has a separate algorithm to distribute the 21 bits into as many bytes are needed. Both are reversible and fully support the Unicode character set. So, you could directly convert one to the other.
The table you reference doesn't show UTF-16. It shows Unicode codepoint hex notation: U+200F. That notation is for humans to identify codepoints. It so happens that when UTF-16 encodes a codepoint in one code unit, it's number is the same as the codepoint's number.
Related
I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("๐") is 1, just as with any other single character. That's independent of the fact that "๐" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "๐" in UTF-16LE, you would invoke "๐".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
When I print a program such as this in Python:
x = b'francis'
The output is b'francis'. If bytes is in 0's and 1's why is it not printing it out?
You seem to be fundamentally confused, in a very common way. The data itself is a distinct concept from its representation, i.e. what you see when you attempt to print it out or otherwise display it. There may be multiple ways to represent the same data. This is just like how if I write 23 (in decimal) or 0x17 (hexadecimal) or 0o27 (octal) or 0b10111 (binary) or twenty-three (English), I am talking about the same number.
At some lower level below Python, everything is bytes, and each byte consists of bits; but it is not correct to say that the bytes "are in" 0s and 1s - just like how it is not correct to say that the number twenty-three "is in" decimal digits (or hexadecimal, octal or binary ones, or in English text characters).
The symbols 0 and 1 are just pictures that we draw on a screen to represent the state of those bits - if we choose to represent them individually. Sometimes, we choose larger groupings, and assign different symbols to various combinations of states. For example, we may interpret multiple bits as a single integer value in binary; or (using Unicode) we might further interpret that number as a "code point" (most of these are text characters; some are control characters, or portions of text characters).
A Python bytes object is a wrapper for a "raw" sequence of bytes. When you display it, Python uses a representation where each byte (grouping of 8 bits) corresponds to one or more symbols: bytes whose corresponding integer value is between thirty-two and one hundred twenty-six (inclusive) are (for historical reasons) represented using individual text characters (following the so-called ASCII encoding), while others are represented with a four-character "escape sequence" beginning with \x and followed by the hexadecimal representation of the number.
From python docs:
bytes and bytearray objects are sequences of integers (between 0 and
255), representing the ASCII value of single bytes.
So they are sequence of integers which represents ASCII values.
For conversion you can use:
import sys
int.from_bytes(b'\x11', byteorder=sys.byteorder) # => 17
bin(int.from_bytes(b'\x11', byteorder=sys.byteorder)) # => '0b10001'
The bytes object was intentionally designed to work like this: the repr uses the corresponding ASCII characters for bytes in the printable ASCII range, well-known backslash escapes for a few special ASCII control characters, and hex backslash escapes for everything else (and the str just is the repr).
The basic idea is that bytes can be used as an immutable array of integers from 0-255, but more often it's used as an immutable array of characters encoded in some ASCII-compatible charset.
In particular, one of the most common uses of bytes is for things like the headers in HTTP, SMTP, and other network protocols. These headers are generally entirely in pure ASCII, or at least pure ASCII keys with some values in pure ASCII and others in an ASCII-compatible charsetโand you generally have to parse the ASCII headers to figure out what charset to use to decode the body. Being able to see those headers are ASCII characters is a lot more useful than just seeing them as a sequence of numbers.
Basically, everything on your computer is eventually represented by 0's and 1's.
The purpose of b-notation isn't as you expected it to be.
I would like to refer you to a great answer that might help you understand what the b-notation is for and how to use it properly:
What does the 'b' character do in front of a string literal?
Good luck.
I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.
But I think that the Unicode code point value is different to UTF-16BE โ one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.
Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?
No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).
For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.
For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.
See the UTF-16 Wikipedia article on the nitty gritty details.
So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.
If I run
print(chr(244).encode())
I get the two-byte result b'\xc3\xb4'. Why is that? I imagine the number 244 can be encoded into one byte!
Your default locale appears to use UTF-8 as the output encoding.
Any codepoint outside the range 0-127 is encoded with multiple bytes in the variable-width UTF-8 codec.
You'll have to use a different codec to encode that codepoint to one byte. The Latin-1 encoding can manage it just fine, while the EBCDIC 500 codec (codepage 500) can too, but encodes to a different byte:
>>> print(chr(244).encode('utf8'))
b'\xc3\xb4'
>>> print(chr(244).encode('latin1'))
b'\xf4'
>>> print(chr(244).encode('cp500'))
b'\xcb'
But Latin-1 and EBCDIC 500 codecs can only encode 255 codepoints; UTF-8 can manage all of the Unicode standard.
If you were expecting the number 244 to be interpreted as a byte value instead, then you should not use chr().encode(); chr() produces a unicode value, not a 'byte', and encoding then produces a different result depending on the exact codec. That's because unicode values are text, not bytes.
Pass your number as a list of integers to the bytes() callable instead:
>>> bytes([244])
b'\xf4'
This only happens to fit the Latin-1 codec result, because the first 256 Unicode codepoints map directly to Latin 1 bytes, by design.
Character #244 is U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX which is indeed encoded as 0xc3 0xb4 in UTF-8. If you want to use a single-byte encoding then you need to specify it.
I imagine the number 244 can be encoded into one byte!
Sure, if you design an encoding that can only handle 256 code points, all of them can be encoded into one byte.
But if you design an encoding that can handle all of Unicode's 111000+ code points, obviously you can't pack all of them into one byte.
If your only goal were to make things as compact as possible, you could use most of the 256 initial byte values for common code points, and only reserve a few as start bytes for less common code points.
However, if you only use the lower 128 for single-byte values, there are some big advantages. Especially if you design it so that every byte is unambiguously either a 7-bit character, a start byte, or a continuation byte. That makes the algorithm is a lot simpler to implement and faster, you can always scan forward or backward to the start of a character, you can search for ASCII text in a string with traditional byte-oriented (strchr) searches, a simple heuristic can detect your encoding very reliably, you can always detect truncated string start/end instead of misinterpreting it, etc. So, that's exactly what UTF-8 does.
Wikipedia explains UTF-8 pretty well. Rob Pike, one of the inventors of UTF-8, explains the design history in detail.
I get a string from a function that is represented like u'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0', but to process it I need it to be bytestring (like '\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0').
How do I convert it without changes?
My best guess so far is to take s.encode('unicode_escape'), which will return '\\xd0\\xbc\\xd0\\xb0\\xd1\\x80\\xd0\\xba\\xd0\\xb0' and process every 5 characters so that '\xd0' becomes one character represented as '\xd0'.
ISO 8859-1 (aka Latin-1) maps the first 256 Unicode codepoints to their byte values.
>>> u'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'.encode('latin-1')
'\xd0\xbc\xd0\xb0\xd1\x80\xd0\xba\xd0\xb0'