Related
Background: I've got a byte file that is encoded using unicode. However, I can't figure out the right method to get Python to decode it to a string. Sometimes is uses 1-byte ASCII text. The majority of the time it uses 2-byte "plain latin" text, but it can possibly contain any unicode character. So my python program needs to be able to decode that and handle it. Unfortunately byte_string.decode('unicode') isn't a thing, so I need to specify another encoding scheme. Using Python 3.9
I've read through the Python doc on unicode and utf-8 Python doc. If Python uses unicode for it's strings, and utf-8 as default, this should be pretty straightforward, yet I keep getting incorrect decodes.
If I understand how unicode works, the most significant byte is the character code, and the least significant byte is the lookup value in the decode table. So I would expect 0x00_41 to decode to "A",
0x00_F2 =>
x65_03_01 => é (e with combining acute accent).
I wrote a short test file to experiment with these byte combinations, and I'm running into a few situations that I don't understand (despite extensive reading).
Example code:
def main():
print("Starting MAIN...")
vrsn_bytes = b'\x76\x72\x73\x6E'
serato_bytes = b'\x00\x53\x00\x65\x00\x72\x00\x61\x00\x74\x00\x6F'
special_bytes = b'\xB2\xF2'
combining_bytes = b'\x41\x75\x64\x65\x03\x01'
print(f"vrsn_bytes: {vrsn_bytes}")
print(f"serato_bytes: {serato_bytes}")
print(f"special_bytes: {special_bytes}")
print(f"combining_bytes: {combining_bytes}")
encoding_method = 'utf-8' # also tried latin-1 and cp1252
vrsn_str = vrsn_bytes.decode(encoding_method)
serato_str = serato_bytes.decode(encoding_method)
special_str = special_bytes.decode(encoding_method)
combining_str = combining_bytes.decode(encoding_method)
print(f"vrsn_str: {vrsn_str}")
print(f"serato_str: {serato_str}")
print(f"special_str: {special_str}")
print(f"combining_str: {combining_str}")
return True
if __name__ == '__main__':
print("Starting Command Line Experiment!")
if not main():
print("\n Command Line Test FAILED!!")
else:
print("\n Command Line Test PASSED!!")
Issue 1: utf-8 encoding. As the experiment is written, I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 0: invalid start byte
I don't understand why this fails to decode, according to the unicode decode table, 0x00B2 should be "SUPERSCRIPT TWO". In fact, it seems like anything above 0x7F returns the same UnicodeDecodeError.
I know that some encoding schemes only support 7 bits, which is what seems like is happening, but utf-8 should support not only 8 bits, but multiple bytes.
If I changed encoding_method to encoding_method = 'latin-1' which extends the original ascii 128 characters to 256 characters (up to 0xFF), then I get a better output:
vrsn_str: vrsn
serato_str: Serato
special_str: ²ò
combining_str: Aude
However, this encoding is not handling the 2-byte codes properly. \x00_53 should be S, not �S, and none of the encoding methods I'll mention in this post handle the combining acute accent after Aude properly.
So far I've tried many different encoding methods, but the ones that are closest are: unicode_escape, latin-1, and cp1252. while I expect utf-8 to be what I'm supposed to use, it does not behave like it's described in the Python doc linked above.
Any help is appreciated. Besides trying more methods, I don't understand why this isn't decoding according to the table in link 3.
UPDATE:
After some more reading, and see your responses, I understand why you're so confused. I'm going to explain further so that hopefully this helps someone in the future.
The byte file that I'm decoding is not mine (hence why the encoding does not make sense). What I see now is that the bytes represent the code point, not the byte representation of the unicode character.
For example: I want 0x00_B2 to translate to ò. But the actual byte representation of ò is 0xC3_B2. What I have is the integer representation of the code point. So while I was trying to decode, what I actually need to do is convert 0x00B2 to an integer = 178. then I can use chr(178) to convert to unicode.
I don't know why the file was written this way, and I can't change it. But I see now why the decoding wasn't working. Hopefully this helps someone avoid the frustration I've been figuring out.
Thanks!
This isn't actually a python issue, it's how you're encoding the character. To convert a unicode codepoint to utf-8, you do not simply get the bytes from the codepoint position.
For example, the code point U+2192 is →. The actual binary representation in utf-8 is: 0xE28692, or 11100010 10000110 10010010
As we can see, this is 3 bytes, not 2 as we'd expect if we only used the position. To get correct behavior, you can either do the encoding by hand, or use a converter such as this one:
https://onlineunicodetools.com/convert-unicode-to-binary
This will let you input a unicode character and get the utf-8 binary representation.
To get correct output for ò, we need to use 0xC3B2.
>>> s = b'\xC3\xB2'
>>> print(s.decode('utf-8'))
ò
The reason why you can't use the direct binary representation is because of the header for the bytes. In utf-8, we can have 1-byte, 2-byte, and 4-byte codepoints. For example, to signify a 1 byte codepoint, the first bit is encoded as a 0. This means that we can only store 2^7 1-byte code points. So, the codepoint U+0080, which is a control character, must be encoded as a 2-byte character such as 11000010 10000000.
For this character, the first byte begins with the header 110, while the second byte begins with the header 10. This means that the data for the codepoint is stored in the last 5 bits of the first byte and the last 6 bits of the second byte. If we combine those, we get
00010 000000, which is equivalent to 0x80.
I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?
ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.
(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.
Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.
Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.
I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?
ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.
(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.
Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.
Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.
What's a Python bytestring?
All I can find are topics on how to encode to bytestring or decode to ASCII or UTF-8. I'm trying to understand how it works under the hood. In a normal ASCII string, it's an array or list of characters, and each character represents an ASCII value from 0-255, so that's how you know what character is represented by the number. In Unicode, it's the 8- or 16-byte representation for the character that tells you what character it is.
So what is a bytestring? How does Python know which characters to represent as what? How does it work under the hood? Since you can print or even return these strings and it shows you the string representation, I don't quite get it...
Ok, so my point is definitely getting missed here. I've been told that it's an immutable sequence of bytes without any particular interpretation.
A sequence of bytes.. Okay, let's say one byte:
'a'.encode() returns b'a'.
Simple enough. Why can I read the a?
Say I get the ASCII value for a, by doing this:
printf "%d" "'a"
It returns 97. Okay, good, the integer value for the ASCII character a. If we interpret 97 as ASCII, say in a C char, then we get the letter a. Fair enough. If we convert the byte representation to bits, we get this:
01100001
2^0 + 2^5 + 2^6 = 97. Cool.
So why is 'a'.encode() returning b'a' instead of 01100001??
If it's without a particular interpretation, shouldn't it be returning something like b'01100001'?
It seems like it's interpreting it like ASCII.
Someone mentioned that it's calling __repr__ on the bytestring, so it's displayed in human-readable form. However, even if I do something like:
with open('testbytestring.txt', 'wb') as f:
f.write(b'helloworld')
It will still insert helloworld as a regular string into the file, not as a sequence of bytes... So is a bytestring in ASCII?
It is a common misconception that text is ASCII or UTF-8 or Windows-1252, and therefore bytes are text.
Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: JPEG, PNG, SVG, and likewise many ways to encode text, ASCII, UTF-8 or Windows-1252.
Once encoding has happened, bytes are just bytes. Bytes are not images anymore; they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember whether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption).
so, in Python (Python 3), we have two types for things that might otherwise look similar; For text, we have str, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring, which doesn't know if it's text or images or any other kind of data.
The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.
Implementationally, str is stored in memory as UCS-? where the ? is implementation defined, it may be UCS-4, UCS-2 or UCS-1, depending on compile time options and which code points are present in the represented string.
"But why"?
Some things that look like text are actually defined in other terms. A really good example of this are the many Internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFCs. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:
2.3. Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked, so long as both the server and the browser understand it correctly.
By comparison, when working with text, you don't really care how it's encoded. You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"
The distinct types thus give you a way to say "this value 'means' text" or "bytes".
Python does not know how to represent a bytestring. That's the point.
When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.
Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.
As the name implies, a Python 3 bytestring (or simply a str in Python 2.7) is a string of bytes. And, as others have pointed out, it is immutable.
It is distinct from a Python 3 str (or, more descriptively, a unicode in Python 2.7) which is a
string of abstract Unicode characters (a.k.a. UTF-32, though Python 3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).
There are essentially three ways of "interpreting" these bytes. You can look at the numeric value of an element, like this:
>>> ord(b'Hello'[0]) # Python 2.7 str
72
>>> b'Hello'[0] # Python 3 bytestring
72
Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters, like this:
>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H
As Jack hinted at, in this latter case it is your terminal interpreting the character, not Python.
Finally, as you have seen in your own research, you can also get Python to interpret a bytestring. For example, you can construct an abstract unicode object like this in Python 2.7:
>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>>
Or like this in Python 3:
>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1
(and I am sure that the amount of syntax churn between Python 2.7 and Python3 around bystestring, strings, and Unicode had something to do with the continued popularity of Python 2.7. I suppose that when Python 3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary).
But the Unicode abstraction does not happen automatically if you don't want it to. The point of a bytestring is that you can directly get at the bytes. Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:
>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'
And this works in both Python 2.7 and Python 3, with the difference being that in Python 2.7 you have str, while in Python3 you have bytestring.
You can also do other wonderful things with bytestrings, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length field correctly, and avoiding Python Bug 8260. In short, use bytestrings when your data is processed and stored in bytes.
Bytes objects are immutable sequences of single bytes. The documentation has a very good explanation of what they are and how to use them.