When I print a program such as this in Python:
x = b'francis'
The output is b'francis'. If bytes is in 0's and 1's why is it not printing it out?
You seem to be fundamentally confused, in a very common way. The data itself is a distinct concept from its representation, i.e. what you see when you attempt to print it out or otherwise display it. There may be multiple ways to represent the same data. This is just like how if I write 23 (in decimal) or 0x17 (hexadecimal) or 0o27 (octal) or 0b10111 (binary) or twenty-three (English), I am talking about the same number.
At some lower level below Python, everything is bytes, and each byte consists of bits; but it is not correct to say that the bytes "are in" 0s and 1s - just like how it is not correct to say that the number twenty-three "is in" decimal digits (or hexadecimal, octal or binary ones, or in English text characters).
The symbols 0 and 1 are just pictures that we draw on a screen to represent the state of those bits - if we choose to represent them individually. Sometimes, we choose larger groupings, and assign different symbols to various combinations of states. For example, we may interpret multiple bits as a single integer value in binary; or (using Unicode) we might further interpret that number as a "code point" (most of these are text characters; some are control characters, or portions of text characters).
A Python bytes object is a wrapper for a "raw" sequence of bytes. When you display it, Python uses a representation where each byte (grouping of 8 bits) corresponds to one or more symbols: bytes whose corresponding integer value is between thirty-two and one hundred twenty-six (inclusive) are (for historical reasons) represented using individual text characters (following the so-called ASCII encoding), while others are represented with a four-character "escape sequence" beginning with \x and followed by the hexadecimal representation of the number.
From python docs:
bytes and bytearray objects are sequences of integers (between 0 and
255), representing the ASCII value of single bytes.
So they are sequence of integers which represents ASCII values.
For conversion you can use:
import sys
int.from_bytes(b'\x11', byteorder=sys.byteorder) # => 17
bin(int.from_bytes(b'\x11', byteorder=sys.byteorder)) # => '0b10001'
The bytes object was intentionally designed to work like this: the repr uses the corresponding ASCII characters for bytes in the printable ASCII range, well-known backslash escapes for a few special ASCII control characters, and hex backslash escapes for everything else (and the str just is the repr).
The basic idea is that bytes can be used as an immutable array of integers from 0-255, but more often it's used as an immutable array of characters encoded in some ASCII-compatible charset.
In particular, one of the most common uses of bytes is for things like the headers in HTTP, SMTP, and other network protocols. These headers are generally entirely in pure ASCII, or at least pure ASCII keys with some values in pure ASCII and others in an ASCII-compatible charset—and you generally have to parse the ASCII headers to figure out what charset to use to decode the body. Being able to see those headers are ASCII characters is a lot more useful than just seeing them as a sequence of numbers.
Basically, everything on your computer is eventually represented by 0's and 1's.
The purpose of b-notation isn't as you expected it to be.
I would like to refer you to a great answer that might help you understand what the b-notation is for and how to use it properly:
What does the 'b' character do in front of a string literal?
Good luck.
Related
I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
def craft_integration(xintegration_time):
integration_time = xintegration_time
integration_time_str = str(integration_time)
integration_time_str = integration_time_str.encode('utf-8')
integration_time_hex = integration_time_str.hex()
return integration_time_hex
def send_set_integration(xtime):
int_time_hex = decoder_crafter.craft_integration(xtime)
set_hex = "c1c000000000000010001100000000000000000000000004"+int_time_hex+"1400000000000000000000000000000000000000c5c4c3c2"
set_hex = str(set_hex)
print(set_hex)
set_hex = unhexlify(set_hex)
For example, input is '1000'.
That becomes 31303030 with craft_integration().
It is then inserted into the default hex string.
Output is:
c1c000000000000010001100000000000000000000000004313030301400000000000000000000000000000000000000c5c4c3c2
When unhexlify() is used, output is:
b'\xc1\xc0\x00\x00\x00\x00\x00\x00\x10\x00\x11\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x041000\x14\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xc5\xc4\xc3\xc2'
\x041000 is an conjunction of \x04 and 1000 which was the original input value, not the converted value.
Why would this happen?
What you have in fact is simply your desired value being rendered into a form by the default implementation of bytes.__repr__ that you were not expecting to the point that it was unhelpful to what you want.
To start from a more basic level: in Python, any element (well, any "byte", i.e. a group of 8 bits) inside a bytes type are typically being stored as raw digital representation somewhere in a machine as binary. In order to "print" them out onto a console for human consumption it must be turned into a form that may be interpreted by the console such that the correct glyph may be used to represent the underlying value. For many values, such as 0 (or 00000000 in binary), Python would use \x00 to represent that. The \ is the escape character to start an escape sequence, the x that follows signifies that the escape sequence is to be followed by 2 hexadecimal characters, and combining those two characters with the whole sequence would form the representation of that single byte using four characters. Likewise for 255, in binary that would be 11111111, and this same value as part of a bytes type will be encoded as \xff.
Now there are exceptions - if a given value falls inside the ASCII range, and that it in the range of printable characters, the representation will instead be the corresponding ASCII character. So in the case of the hexadecimal 30 (decimal 48), rendering of that as part of a bytes type will show 0 instead of \x30, as 0 is the corresponding printable character.
So for your case, a bytes representation that was printed out in the console in the form of b'\x041000', is not in fact a big \x value, as the \x escape sequence is only applied to exactly two subsequent characters - all following characters (i.e. 1000) are in fact being represented using the printable characters that would otherwise be represented as \x31\x30\x30\x30.
There is another method available to those who don't mind working with the decimal representation of bytes - simply cast the bytes into a bytearray then into a list. We will take two nul bytes (b'\x00\x00') as an example:
>>> list(bytearray(b'\x00\x00'))
[0, 0]
Clearly those two nul bytes will correspond to two zero values. Now try using the confusing b'\x04\x31\x30\x30\x30' which got rendered into b'\x041000':
>>> list(bytearray(b'\x041000'))
[4, 49, 48, 48, 48]
We can note that it was in fact 5 bytes rendered with the corresponding decimal numbers in a list of 5 elements.
It is often easy to get confused with what the actual value is, vs. what is being shown and visualized on the computer console. Unfortunately the tools we use sometimes amplify that confusion, but as programmers we should understand this and seek ways to minimize this for users of our work, as this example shows that not everyone may have the intuition that certain representations of bytes may instead be represented as printable ASCII.
I am quite confused about the concept of character encoding.
What is Unicode, GBK, etc? How does a programming language use them?
Do I need to bother knowing about them? Is there a simpler or faster way of programming without having to trouble myself with them?
ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00 - 0x1F contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)
0x20 - 0x40 contain numbers and punctuation
0x41 - 0x7F contain mostly alphabetic characters
0x80 - 0xFF the 8th bit = undefined.
French, German and many other languages needed additional characters. (e.g. à, é, ç, ô, ...) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".
The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (latin-1 being a very popular one).
Popular question: "Is ASCII a character set or is it an encoding" ? ASCII is a character set. However, in programming charset and encoding are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that's US-ASCII.
Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format u+xxxx. It has the ambition to contain all characters (and popular icons) used in the entire world.
UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
Content-Type: text/html; charset=utf-8
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g. text/xml, image/png, ...). And in some cases it will additionally also describe how the data is encoded (i.e. charset=utf-8). 2 points of confusion:
Not all mime types declare an encoding. In some cases it is only optional or sometimes completely pointless.
The syntax charset=utf-8 adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.
For example, in the case of text/xml it would be pointless to declare an encoding (and a charset parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the <?xml encoding=... tag. If it's there, then they will reopen the file using that encoding.
The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
a byte is stored as a signed byte (range: -128 to 127).
the char type in java is stored in 2 unsigned bytes (range: 0 - 65535)
a stream returns an integer in range -1 to 255.
If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
// the -1 indicates that there is no data
int input = stream.read();
if (input == -1) throw new EOFException();
// bytes must be made positive first.
byte myByte = (byte) input;
int unsignedInteger = myByte & 0xFF;
char ascii = (char)(unsignedInteger);
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
// wrap your stream in a reader.
// specify the encoding
// The reader will decode the data for you
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_8);
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.
(Note that I'm using some of these terms loosely/colloquially for a simpler explanation that still hits the key points.)
A byte can only have 256 distinct values, being 8 bits.
Since there are character sets with more than 256 characters in the character set one cannot in general simply say that each character is a byte.
Therefore, there must be mappings that describe how to turn each character in a character set into a sequence of bytes. Some characters might be mapped to a single byte but others will have to be mapped to multiple bytes.
Those mappings are encodings, because they are telling you how to encode characters into sequences of bytes.
As for Unicode, at a very high level, Unicode is an attempt to assign a single, unique number to every character. Obviously that number has to be something wider than a byte since there are more than 256 characters :) Java uses a version of Unicode where every character is assigned a 16-bit value (and this is why Java characters are 16 bits wide and have integer values from 0 to 65535). When you get the byte representation of a Java character, you have to tell the JVM the encoding you want to use so it will know how to choose the byte sequence for the character.
Character encoding is what you use to solve the problem of writing software for somebody who uses a different language than you do.
You don't know how what the characters are and how they are ordered. Therefore, you don't know what the strings in this new language will look like in binary and frankly, you don't care.
What you do have is a way of translating strings from the language you speak to the language they speak (say a translator). You now need a system that is capable of representing both languages in binary without conflicts. The encoding is that system.
It is what allows you to write software that works regardless of the way languages are represented in binary.
Most computer programs must communicate with a person using some text in a natural language (a language used by humans). But computers have no fundamental means for representing text: the fundamental computer representation is a sequence of bits organized into bytes and words, with hardware support for interpreting sequences of bits as fixed width base-2 (binary) integers and floating-point real numbers. Computer programs must therefore have a scheme for representing text as sequences of bits. This is fundamentally what character encoding is. There is no inherently obvious or correct scheme for character encoding, and so there exist many possible character encodings.
However, practical character encodings have some shared characteristics.
Encoded texts are divided into a sequence of characters (graphemes).
Each of the known possible characters has an encoding. The encoding of a text consists of the sequence of the encoding of the characters of the text.
Each possible (allowed) character is assigned a unique unsigned (non negative) integer (this is sometimes called a code point). Texts are therefore encoded as a sequence of unsigned integers. Different character encodings differ in the characters they allow, and how they assign these unique integers. Most character encodings do not allow all the characters used by the many human writing systems (scripts) that do and have existed. Thus character encodings differ in which texts they can represent at all. Even character encodings that can represent the same text can represent it differently, because of their different assignment of code points.
The unsigned integer encoding a character is encoded as a sequence of bits. Character encodings differ in the number of bits they use for this encoding. When those bits are grouped into bytes (as is the case for popular encodings), character encodings can differ in endianess. Character encodings can differ in whether they are fixed width (the same number of bits for each encoded character) or variable width (using more bits for some characters).
Therefore, if a computer program receives a sequence of bytes that are meant to represent some text, the computer program must know the character encoding used for that text, if it is to do any kind of manipulation of that text (other than regarding it as an opaque value and forwarding it unchanged). The only possibilities are that the text is accompanied by additional data that indicates the encoding used or the program requires (assumes) that the text has a particular encoding.
Similarly, if a computer program must send (output) text to another program or a display device, it must either tell the destination the character encoding used or the program must use the encoding that the destination expects.
In practice, almost all problems with character encodings are caused when a destination expects text sent using one character encoding, and the text is actually sent with a different character encoding. That in turn is typically caused by the computer programmer not bearing in mind that there exist many possible character encodings, and that their program can not treat encoded text as opaque values, but must convert from an external representation on input and convert to an external representation on output.
In Python (either 2 or 3), evaluating b'\xe2\x80\x8f'.decode("utf-8")
yields \u200f, and similarly '\u200f'.encode("utf-8") yields b'\xe2\x80\x8f'.
The first looks like a chain of three 2-character hex values that would equal decimal 226, 128, and 143. The second looks like a single hex value that would equal decimal 8,207.
Is there a logical relationship between '\xe2\x80\x8f' and '\u200f' ? Am I interpreting the values incorrectly?
I can see the values are linked somehow in tables like this one: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string-literal
but why are these two values on the same row?
The difference is related to the amount of bits/bytes that each character takes up to represent in utf-8.
For any character equal to or below 127 (hex 0x7F), the UTF-8
representation is one byte. It is just the lowest 7 bits of the full
unicode value. This is also the same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8
representation is spread across two bytes. The first byte will have
the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The
second byte will have the top bit set and the second bit clear (i.e.
0x80 to 0xBF).
There is more information about this here.
If you wanted more info on how Python uses these values, check out here.
Yes, the first is "a chain of three 2-character hex values that would equal decimal 226, 128, and 143." It's a byte string. You got a byte string because that's what encode does. You passed it UTF-8 so the bytes are the UTF-8 encoding for the input character string.
"The second looks like a single hex value that would equal decimal 8,207." Sort of; It's the notation for a UTF-16 code unit inside a literal character string. One or two UTF-16 code units encode a Unicode codepoint. In this case, only one is used for the corresponding codepoint.
Sure, you can convert the hex to decimal but that's not very common or useful in either case. A code unit is a specific bit pattern. Bytes are that bit pattern as an integer, serialized to a byte sequence.
The Unicode codepoint range needs 21 bits. UTF-16 encodes a codepoint in one or two 16-bit code units (so that's two bytes in some byte order for each code unit). UTF-8 encodes a codepoint in one, two, three or four 8-bit code units. (An 8-bit integer is one byte so byte order is moot.) Each character encoding has a separate algorithm to distribute the 21 bits into as many bytes are needed. Both are reversible and fully support the Unicode character set. So, you could directly convert one to the other.
The table you reference doesn't show UTF-16. It shows Unicode codepoint hex notation: U+200F. That notation is for humans to identify codepoints. It so happens that when UTF-16 encodes a codepoint in one code unit, it's number is the same as the codepoint's number.
I've been reading about base64 conversion, and what I understand is that the encoded version of the original data will be 133% of the original size.
Then, I'm reading about how YouTube is able to have unique identifiers to their videos like FJZQSHn7fc and the reason was: an 11 character base64 string can map to a huge number.
Wait, say a huge number contains 20 characters, then wouldn't a base64 encoded string be 133% of that size, not shorter?
I'm very confused. Are there different types of base64 conversion (string to base64 vs. decimal to base64), once resulting in a bigger, and the other in a smaller resulting string?
Each character in base 64 can encode 6 bits of data. Thus 11 characters can encode 6x11 = 66 bits of data.
2^66 = 73786976294838206464
73786976294838206464 (approximately 7.4 x 10^19 or 74 quintillion) possible identifiers is more than enough to distinguish unique YouTube videos for the foreseeable future.
It is unlikely that YouTube is using these strings of length 11 as encodings of smaller objects. You can use base64 (just a number in base 64 after all) without having to think of it as an encoding of something else, just like you can use bytes (binary numbers with 8 bits) without thinking of those bytes as being encodings of ascii characters. The only important question with an identifier scheme is if there are enough identifiers to go around. In this case there clearly are.
Think of it like this: you have a 64bit number (called long in Java, for example).
Now, you can print that number in different ways:
As a binary number (base 2), printing 64 '0' or '1'
As a decimal number (base 10), printing up to 20 decimal digits
As a hexadecimal number (base 16), printing 16 hexadeciaml digits
As a number in base 64, printing 11 "digits" in that base. You can use any graphical symbols as digits.
... you understand by now that there are many more possibilities ...
It seems like they use the same base-64 numbers as the ones that are used in base64 encoding, that is, uppercase and lowercase letters, ordinary digits and 2 extra chars. Each character represents a 6-bit value. So you get 66 bits, and depending on the algorithm used, either the leading or trailing 2 bits are cut off to get a nice long value back.
You are confusing what things are being compared.
There are 2 statements, both comparing different things:
"base64 encoding is 133% bigger than original size"
"An 11 character base64 string can encode a huge number"
In the case of 1, they are normally referring to a string encoded maybe with ASCII using 8bits a character, and comparing that with the same string encoded in base64. That is 133% bigger, because in base64 you can't use all 255 bit combinations in every byte.
In the case of 2, they are comparing using a numeric identifier, and then either encoding it as base64, or base10. In this case, base64 is a lot shorter than base10.
You can also think of the (1) case as comparing base256 against base64, and the (2) case as comparing base10 against base64.
When you say Base64, some would think of RFC 4648. If YouTube is using RFC 4648, then it's a 12-digit number where they're omitting the last digit because it is always '=', the padding character (the 65th element of the base64 alphabet). The 12 digits represent three blocks of four digits, and four digits yield 24 bits of information. YouTube video IDs would therefore be 64-bit, not 66-bit, if they're using the standard.
Those 64 bits might be representing an unsigned integer. YouTube used MySQL and then sharded MySQL through Vitess, so you could imagine them using an UNSIGNED BIGINT key internally that they encode via RFC 4648-compliant Base64 externally.
Clearly Tom Scott thinks YouTube is squeezing 66 bits out of their 11 characters; his video says so.
If he's wrong, then their frontend might allow you to specify four distinct video IDs for the same video. Those two extra bits' values do not affect the UNSIGNED BIGINT. Which two bits they are depend on endianness and other choices of encoding.
Regardless of whether YouTube is using standard or nonstandard encoding, they can represent 18446744073709551615 in 11 characters (since the padding character is always there and and thus omitted for a 64-bit quantity).
Perhaps they use something like the following to compute a pseudorandom 64-bit integer when a new video is created:
import base64
import random
def Base64RandomSlug():
array = bytearray(random.getrandbits(8) for x in range(64 // 8))
b = base64.urlsafe_b64encode(bytes(array))
return b.decode('utf-8').rstrip('=')