The fundametal method to convert a hex to base64 in python3

The fundametal method to convert a hex to base64 in python3 - python

I want to convert a given hex into base64 (in python without using any libraries). As I learned from other stackoverflow answers, we can either group 3 hex (12 bits i.e. 4 bits each) to get 2 base64 values (12 bits i.e. 6 bits each). And also we can group 6 hex(24 bits) into 4 base64 values (24 bits).
The standard procedure is to append all the binary bits of hexs together and start grouping from left in packets of 6.
My question is regarding the situation we need padding for:
(Assuming we are converting 3 hex into 2 base64)
There will arise a situation when we are left with only 2 or 1 hex values to convert. Take the example below:
'a1' to base64
10100001 (binary of a1)
101000 01(0000) //making groups of 6 and adding additional 0's where required
This gives "oQ"the answer which is at some place(oQ==) and something different in other place(wqE=)
Q1. Which of the two sources are giving the correct answer? Why the other one is wrong being a good online decoder?
Q2. How do we realise the number of '=' here? (We could have just add sufficient 0's wherever needed as in example above, and thus ending the answer to be just oQ here and not oQ== , assuming oQ== is correct)
My concept is that: if the hex is of length 2 (rather than 3) we pad with a single = (hence complying with the answer wqE= in above case)
, else if the hex is of length 1 ( rather than 3), we pad with double ='s.
At the same time, I am confused that, if 3 hex is converted into 2 base64, we would never need two ='s.
'a' to base64
1010 (binary of a)
Q3. How to convert hex 'a' to base64.

Base64 is defined by RFC 4648 as being "designed to represent arbitrary sequences of
octets". Octet is a unit of 8 bits, in practice synonymous with byte. When your input is in the form of a hex string, your first step should be to decode it into a byte string. You need two hex characters for each byte. If the length of the input is odd, the reasonable course of action is to raise an error.
To address you numbered questions:
Q1: Even while going to implement you own encoder, you can make use of Python standard library to investigate. Decoding the two results back to bytes gives:
>>> import base64
>>> base64.b64decode(b'oQ==')
b'\xa1'
>>> base64.b64decode(b'wqE=')
b'\xc2\xa1'
So, oQ== is correct, while wqE= has a c2 byte added in front. I can guess that it is the result of applying UTF-8 encoding before Base64. To confirm:
>>> '\u00a1'.encode('utf-8')
b'\xc2\xa1'
Q2: The rules for padding are detailed in the RFC.
Q3: This is ambiguous and you are right to be confused.

Related

Python3 counting UTF-16 code points in a string

I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.

Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("𐐀") is 1, just as with any other single character. That's independent of the fact that "𐐀" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "𐐀" in UTF-16LE, you would invoke "𐐀".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2

For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.

How can I extract mixed binary and ascii values from a bytes string like I did in 2.x?

The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks

I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.

Python Byte doesn't print binary

When I print a program such as this in Python:
x = b'francis'
The output is b'francis'. If bytes is in 0's and 1's why is it not printing it out?

You seem to be fundamentally confused, in a very common way. The data itself is a distinct concept from its representation, i.e. what you see when you attempt to print it out or otherwise display it. There may be multiple ways to represent the same data. This is just like how if I write 23 (in decimal) or 0x17 (hexadecimal) or 0o27 (octal) or 0b10111 (binary) or twenty-three (English), I am talking about the same number.
At some lower level below Python, everything is bytes, and each byte consists of bits; but it is not correct to say that the bytes "are in" 0s and 1s - just like how it is not correct to say that the number twenty-three "is in" decimal digits (or hexadecimal, octal or binary ones, or in English text characters).
The symbols 0 and 1 are just pictures that we draw on a screen to represent the state of those bits - if we choose to represent them individually. Sometimes, we choose larger groupings, and assign different symbols to various combinations of states. For example, we may interpret multiple bits as a single integer value in binary; or (using Unicode) we might further interpret that number as a "code point" (most of these are text characters; some are control characters, or portions of text characters).
A Python bytes object is a wrapper for a "raw" sequence of bytes. When you display it, Python uses a representation where each byte (grouping of 8 bits) corresponds to one or more symbols: bytes whose corresponding integer value is between thirty-two and one hundred twenty-six (inclusive) are (for historical reasons) represented using individual text characters (following the so-called ASCII encoding), while others are represented with a four-character "escape sequence" beginning with \x and followed by the hexadecimal representation of the number.

From python docs:
bytes and bytearray objects are sequences of integers (between 0 and
255), representing the ASCII value of single bytes.
So they are sequence of integers which represents ASCII values.
For conversion you can use:
import sys
int.from_bytes(b'\x11', byteorder=sys.byteorder) # => 17
bin(int.from_bytes(b'\x11', byteorder=sys.byteorder)) # => '0b10001'

The bytes object was intentionally designed to work like this: the repr uses the corresponding ASCII characters for bytes in the printable ASCII range, well-known backslash escapes for a few special ASCII control characters, and hex backslash escapes for everything else (and the str just is the repr).
The basic idea is that bytes can be used as an immutable array of integers from 0-255, but more often it's used as an immutable array of characters encoded in some ASCII-compatible charset.
In particular, one of the most common uses of bytes is for things like the headers in HTTP, SMTP, and other network protocols. These headers are generally entirely in pure ASCII, or at least pure ASCII keys with some values in pure ASCII and others in an ASCII-compatible charset—and you generally have to parse the ASCII headers to figure out what charset to use to decode the body. Being able to see those headers are ASCII characters is a lot more useful than just seeing them as a sequence of numbers.

Basically, everything on your computer is eventually represented by 0's and 1's.
The purpose of b-notation isn't as you expected it to be.
I would like to refer you to a great answer that might help you understand what the b-notation is for and how to use it properly:
What does the 'b' character do in front of a string literal?
Good luck.

Python 3.5 base64 decoding seems to be incorrect?

In Python 3.5 the base64 module has a method, standard_b64decode() for decoding strings from base64, which returns a bytes object.
When I run base64.standard_b64decode("wc==") the output is b\xc1. When you base64 encode "\xc1", you get "wQ==". It looks like there is an error in the decoding function. Actually, I think "wc==" is an invalid base64 encoded string, by this reasoning:
wc== ends with ==, which means that it was produced from a single input byte.
The corresponding values of 'w' and 'c' in the regular base64 alphabet are, respectively, 48 and 28, meaning their 6-bit representations are, respectively, 110000 and 011100.
Concatenating these, the first 8 bits are 11000001, which is \xc1, but the remaining bits (1100) are non-zero, so couldn't have been produced by the padding process performed during base64 encoding, as that only appends bits with value 0, which means these extra 1 bits can't have been produced through valid base64 encoding -> the string is not a valid base64 encoded string.
I think this is true for any 4 character chunk of base64 encoding ending in == when any of the last 4 bits of the second character are 1.
I'm pretty convinced that this is right, but I'm rather less experienced than the Python developers.
Can anyone confirm the above, or explain why it's wrong, if indeed it is?

The Base64 standard is defined by RFC 4648. Your question is answered by §3.5:
Canonical Encoding
The padding step in base 64 and base 32 encoding can, if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed.
In some environments, the alteration is critical and therefore decoders MAY chose to reject an encoding if the pad bits have not been set to zero.
The meaning of MAY is defined by RFC 2119:
MAY This word, or the adjective "OPTIONAL", mean that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item.
So Python is not obliged by the standard to reject non-canonical encodings.

Base64 conversion decimals

I've been reading about base64 conversion, and what I understand is that the encoded version of the original data will be 133% of the original size.
Then, I'm reading about how YouTube is able to have unique identifiers to their videos like FJZQSHn7fc and the reason was: an 11 character base64 string can map to a huge number.
Wait, say a huge number contains 20 characters, then wouldn't a base64 encoded string be 133% of that size, not shorter?
I'm very confused. Are there different types of base64 conversion (string to base64 vs. decimal to base64), once resulting in a bigger, and the other in a smaller resulting string?

Each character in base 64 can encode 6 bits of data. Thus 11 characters can encode 6x11 = 66 bits of data.
2^66 = 73786976294838206464
73786976294838206464 (approximately 7.4 x 10^19 or 74 quintillion) possible identifiers is more than enough to distinguish unique YouTube videos for the foreseeable future.
It is unlikely that YouTube is using these strings of length 11 as encodings of smaller objects. You can use base64 (just a number in base 64 after all) without having to think of it as an encoding of something else, just like you can use bytes (binary numbers with 8 bits) without thinking of those bytes as being encodings of ascii characters. The only important question with an identifier scheme is if there are enough identifiers to go around. In this case there clearly are.

Think of it like this: you have a 64bit number (called long in Java, for example).
Now, you can print that number in different ways:
As a binary number (base 2), printing 64 '0' or '1'
As a decimal number (base 10), printing up to 20 decimal digits
As a hexadecimal number (base 16), printing 16 hexadeciaml digits
As a number in base 64, printing 11 "digits" in that base. You can use any graphical symbols as digits.
... you understand by now that there are many more possibilities ...
It seems like they use the same base-64 numbers as the ones that are used in base64 encoding, that is, uppercase and lowercase letters, ordinary digits and 2 extra chars. Each character represents a 6-bit value. So you get 66 bits, and depending on the algorithm used, either the leading or trailing 2 bits are cut off to get a nice long value back.

You are confusing what things are being compared.
There are 2 statements, both comparing different things:
"base64 encoding is 133% bigger than original size"
"An 11 character base64 string can encode a huge number"
In the case of 1, they are normally referring to a string encoded maybe with ASCII using 8bits a character, and comparing that with the same string encoded in base64. That is 133% bigger, because in base64 you can't use all 255 bit combinations in every byte.
In the case of 2, they are comparing using a numeric identifier, and then either encoding it as base64, or base10. In this case, base64 is a lot shorter than base10.
You can also think of the (1) case as comparing base256 against base64, and the (2) case as comparing base10 against base64.

When you say Base64, some would think of RFC 4648. If YouTube is using RFC 4648, then it's a 12-digit number where they're omitting the last digit because it is always '=', the padding character (the 65th element of the base64 alphabet). The 12 digits represent three blocks of four digits, and four digits yield 24 bits of information. YouTube video IDs would therefore be 64-bit, not 66-bit, if they're using the standard.
Those 64 bits might be representing an unsigned integer. YouTube used MySQL and then sharded MySQL through Vitess, so you could imagine them using an UNSIGNED BIGINT key internally that they encode via RFC 4648-compliant Base64 externally.
Clearly Tom Scott thinks YouTube is squeezing 66 bits out of their 11 characters; his video says so.
If he's wrong, then their frontend might allow you to specify four distinct video IDs for the same video. Those two extra bits' values do not affect the UNSIGNED BIGINT. Which two bits they are depend on endianness and other choices of encoding.
Regardless of whether YouTube is using standard or nonstandard encoding, they can represent 18446744073709551615 in 11 characters (since the padding character is always there and and thus omitted for a 64-bit quantity).
Perhaps they use something like the following to compute a pseudorandom 64-bit integer when a new video is created:
import base64
import random
def Base64RandomSlug():
array = bytearray(random.getrandbits(8) for x in range(64 // 8))
b = base64.urlsafe_b64encode(bytes(array))
return b.decode('utf-8').rstrip('=')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.