base64 encode & decode questions where a != b

base64 encode & decode questions where a != b - python

Given this example in Python
sample = '5PB37L2CH5DUDWN2SUOYE6LJPYCJBFM5N2FGVEHF7HD224UR52KB===='
a = base64.b32decode(sample)
b = base64.b32encode(a)
why is it that
sample != b ?
BUT where
sample = '5PB37L2CH5DUDWN2SUOYE6LJPYCJBFM5N2FGVEHF7HD224UR52KBAAAA'
then
sample == b

the first sample you got there is invalid base64.
taken from wiki:
When the number of bytes to encode is not divisible by 3 (that is, if there are only one or two bytes of input for the last block), then the following action is performed: Add extra bytes with value zero so there are three bytes, and perform the conversion to base64. If there was only one significant input byte, only the first two base64 digits are picked, and if there were two significant input bytes, the first three base64 digits are picked. '=' characters might be added to make the last block contain four base64 characters.
http://en.wikipedia.org/wiki/Base64#Examples
edit:
taken from RFC 4648:
Special processing is performed if fewer than 24 bits are available
at the end of the data being encoded. A full encoding quantum is
always completed at the end of a quantity. When fewer than 24 input
bits are available in an input group, bits with value zero are added
(on the right) to form an integral number of 6-bit groups. Padding
at the end of the data is performed using the '=' character.
4 times 8bits (the ='s) (at the end of your sample) is more than 24bits so they are at the least unneccessary. (not sure what datatype sample is, but find out and take it's size times number of characters divided by 24)
about your particular sample:
base-encoding reads in 24bit chunks and only needs '=' padding characters at the end of the base'd string to make whatever was left of the string after splitting it into 24bit chunks be "of size 24" so it can be parsed by the decoder.
since the ===='s at the end of your string amount to more than 24bits they are useless, hence: invalid...

First, let's be clear: your question is about base32, not base64.
Your original sample is a bit too long. There are 4 = padding at the end, meaning at least 20 bits of padding. The number of bits must be a multiple of 8 so it's really 24 bits. The encoding for B in base32 is 1, which means one of the padding bits is set. This is a violation of the spec, which says all the padding bits must be clear. The decode drops the bit completely, and the encode produces the proper value A instead of B.

Related

Convert compression algorithm to decompression

sorry for the simple question, but it's blowing by brain, since I'm not good at data structure.
First, I have an initial binary file with compressed raw data. My colleague helped me out to turn the bytes into an array of decimals in Python (the code is given below and works just fine, showing the result as a chart in pyplot).
Now, I want to do the reverse operation, e.g. turn an array of decimal numbers into a binary file, but I'm totally stuck. Thank you very much in advance!
data_out = []
# decode 1st point
data_out.append(int.from_bytes(data_in[0:4], byteorder='big', signed=True))
i = 4
while i < len(data_in):
# get next byte
curr = int.from_bytes(data_in[i:i+1], byteorder='big', signed=False)
if curr < 255:
res = curr - 127
data_out.append(res + data_out[-1])
i = i + 1
else:
res = int.from_bytes(data_in[i+1:i+5], byteorder='little', signed=True)
data_out.append(res)
i = i + 5
from matplotlib import pyplot as plt
plt.plot(data_out)
plt.show()

The original stream of bytes was encoded as one or four-byte integers. The first value is sent as a four-byte integer. After the first value, you have either one byte in the range 0..254, which represents a difference of -127 to 127, or you have 255 followed by four-byte signed little-endian integer, which is the next value (not a difference). The idea is that if the integers are changing slowly from one to the next, this will compress the sequence by up to a factor of four by sending small differences as one byte instead of four. Though if you have too many differences that don't fit in a byte, this could expand the data by 25%, since non-difference values take five bytes instead of four.
To encode such a stream, you start by encoding the first value directly as four bytes, little endian. For each subsequent value, you subtract the previous value from this one. If the result is in the range -127 to 127, then add 127 and send that byte. Otherwise send a 255 byte, followed by the value (not the difference) as a four-byte signed little-endian integer.
As pointed out by #greybeard, there is an error in your colleague's code (assuming it was copied here correctly), in that res is not initialized. The first point decoding needs to be:
# decode 1st point
res = int.from_bytes(data_in[0:4], byteorder='big', signed=True)
data_out.append(res)

Pack into c types and obtain the binary value back

I'm using the following code to pack an integer into an unsigned short as follows,
raw_data = 40
# Pack into little endian
data_packed = struct.pack('<H', raw_data)
Now I'm trying to unpack the result as follows. I use utf-16-le since the data is encoded as little-endian.
def get_bin_str(data):
bin_asc = binascii.hexlify(data)
result = bin(int(bin_asc.decode("utf-16-le"), 16))
trimmed_res = result[2:]
return trimmed_res
print(get_bin_str(data_packed))
Unfortunately, it throws the following error,
result = bin(int(bin_asc.decode("utf-16-le"), 16)) ValueError: invalid
literal for int() with base 16: '㠲〰'
How do I properly decode the bytes in little-endian to binary data properly?

Use unpack to reverse what you packed. The data isn't UTF-encoded so there is no reason to use UTF encodings.
>>> import struct
>>> data_packed = struct.pack('<H', 40)
>>> data_packed.hex() # the two little-endian bytes are 0x28 (40) and 0x00 (0)
2800
>>> data = struct.unpack('<H',data_packed)
>>> data
(40,)
unpack returns a tuple, so index it to get the single value
>>> data = struct.unpack('<H',data_packed)[0]
>>> data
40
To print in binary use string formatting. Either of these work work best. bin() doesn't let you specify the number of binary digits to display and the 0b needs to be removed if not desired.
>>> format(data,'016b')
'0000000000101000'
>>> f'{data:016b}'
'0000000000101000'

You have not said what you are trying to do, so let's assume your goal is to educate yourself. (If you are trying to pack data that will be passed to another program, the only reliable test is to check if the program reads your output correctly.)
Python does not have an "unsigned short" type, so the output of struct.pack() is a byte array. To see what's in it, just print it:
>>> data_packed = struct.pack('<H', 40)
>>> print(data_packed)
b'(\x00'
What's that? Well, the character (, which is decimal 40 in the ascii table, followed by a null byte. If you had used a number that does not map to a printable ascii character, you'd see something less surprising:
>>> struct.pack("<H", 11)
b'\x0b\x00'
Where 0b is 11 in hex, of course. Wait, I specified "little-endian", so why is my number on the left? The answer is, it's not. Python prints the byte string left to right because that's how English is written, but that's irrelevant. If it helps, think of strings as growing upwards: From low memory locations to high memory. The least significant byte comes first, which makes this little-endian.
Anyway, you can also look at the bytes directly:
>>> print(data_packed[0])
40
Yup, it's still there. But what about the bits, you say? For this, use bin() on each of the bytes separately:
>>> bin(data_packed[0])
'0b101000'
>>> bin(data_packed[1])
'0b0'
The two high bits you see are worth 32 and 8. Your number was less than 256, so it fits entirely in the low byte of the short you constructed.
What's wrong with your unpacking code?
Just for fun let's see what your sequence of transformations in get_bin_str was doing.
>>> binascii.hexlify(data_packed)
b'2800'
Um, all right. Not sure why you converted to hex digits, but now you have 4 bytes, not two. (28 is the number 40 written in hex, the 00 is for the null byte.) In the next step, you call decode and tell it that these 4 bytes are actually UTF-16; there's just enough for two unicode characters, let's take a look:
>>> b'2800'.decode("utf-16-le")
'㠲〰'
In the next step Python finally notices that something is wrong, but by then it does not make much difference because you are pretty far away from the number 40 you started with.
To correctly read your data as a UTF-16 character, call decode directly on the byte string you packed.
>>> data_packed.decode("utf-16-le")
'('
>>> ord('(')
40

base64 string length calculation when encoding unsigned integers only

I am trying to figure out estimates for how many unsigned integer numbers I can encode with 5 characters of base64, 6 characters, and so on.
Through programmatic approach I found out that I can encode
2^28 - 1 = 268,435,455
with 6 characters and
2^35 - 1 = 34,359,738,368
with 7 characters.
(-1 because I start at uint 1)
I am struggling to generalize this though, since I would assume it starts at 2^8 = 256 but I don't get how I end up at 28 and 35.
This is my implementation in Go
func Shorten(num uint64) string {
buf := make([]byte, binary.MaxVarintLen64)
n := binary.PutUvarint(buf, num)
b := buf[:n]
encoded := base64.URLEncoding.EncodeToString(b)
return strings.Replace(encoded, "=", "", -1)
}
Also
0 -> AA
128 -> gAE
16384 -> gIAB
2097152 -> gICAAQ
268435456 -> gICAgAE
So it looks like it's going up in 7 increments: 2^7, 2^14, 2^21, etc. but why 7?

A byte is 8 bits and therefore 256 possible values. Base 64 uses 64 different characters to encode and therefore is using 6 bits. so how many 8 bit objects can you fit in 6 bits? 0 if you're rounding or 3/4 if you aren't. When you start talking about encoding integers however your numbers do not appear to make sense. Are you talking about integers written in ascii? with 6 base64 characters you have 36 bits to play with so if you're talking about binary 32-bit unsigned integers you can encode one at a time but you can encode any of them that you want for 2**32 different possibilities and then 4 wasted bits. With ascii you'd have 4 characters so it would be 10000 different possibilities (0 to 9999).
You are getting unexpected results because you're using go varints which are not encoded as regular binary integers. some ipython output for you:
In [22]: base64.b64encode((128).to_bytes(1,'little'))
Out[22]: b'gA=='
because 128 can be encoded in a single 8 bit byte it is only 2 characters with some padding. and look at this:
In [3]: base64.b64decode('gAE=')
Out[3]: b'\x80\x01'
In [4]: int.from_bytes(_,'little')
Out[4]: 384
So as you can see PutUVarint isn't just encoding an integer of variable length it's encoding a variable integer, ie it has been encoded in a way that it can be decoded without knowing in advance what size it is. If you look at the source code for the varint go module it describes this process. Go is using 7 bits of each byte to hold actual integer binary data and the most significant bit is a flag as to whether or not there is more data yet to come. 128 is just the most significant bit of one byte set. So basically you're encoding twice based on the way you're accomplishing this task. If you have a given integer to encode it as a var int you need the number of bytes that the integer uses *8/7 to store the value then you base64 encode that result so you need that value *8/6 to store that. Depending on what you're doing with the base64 you can likely determine how many bytes you're playing with without needing to resort to the go varints and then the calculation would just be the 8/6 conversion (which is 4/3 I just left it in bits to match the varint process more closely.)

Implementation of SHA256 in python3, final hash is too short

I'm trying to write an implementation of SHA-256 in python 3. My version is supposed to take in a hexadecimal encoding and output the corresponding hash value. I've used https://en.wikipedia.org/wiki/SHA-2#Pseudocode as guide.
My function works well for most inputs but sometimes it gives an output that is only 63bits (instead of 64). My function uses 32bit binary strings.
I think I have found the problem, in the last step of the algorithm the binary addition
h4 := h4 + e (or another h-vector and corresponding letter)
yields a binary number that is too small. The last thing I do is to use hex() and I should get a string of 8 characters. In this example I only get 7.
out4 = hex(int(h4,2))[2:]
One problematic input is e5e5e5
It gives
"10110101111110101011010101101100" for h4 and "01010001000011100101001001111111" for e
so the addition gives "00000111000010010000011111101011"
and out4 = 70907eb.
What should I do in these cases?

I should get a string of 8 characters
Why do you think so? hex doesn't allow to specify the length of the output to begin with, so, for example, if the correct output is 8 bytes of zeros, hex will return 0x0 - the shortest representation possible.
I'm guessing the correct output should begin with zero, but hex is cutting it off. Use format strings to specify the length of output:
In [1]: f'{0:08x}'
Out[1]: '00000000' # lowercase hexadecimal (x) digits that must fit into at least 8 characters, prefixed with zero (08) as needed

Decoding Scrambed Binary Data from GPS with Python

I've got a raw binary file (1 KB↓) that is a serial data dump of a GPS stream (along with some associated metadata). I'm specifically trying to pull a value out of the binary file that represents the GPS time; I know its offset and width in the file (10 and 8 bytes respectively, with a total frame width of 28 bytes) but it's encoded in a very weird way as described in the quote below.
What's the most Pythonic way to read this data (into a list or array)?
GPS TIME - GPS Sensor time (time of week in seconds, starting at
Saturday 2400 hours/ Sunday 0000 hours) if GPS Time Valid Message 3500
is set to 1, otherwise SDN500 system time since power up is reported.
Data words are in the order 2, 1 (MSW), 4 (LSW), 3.
A message word length is 16 bits on the SDN500–HV interface. However,
the SDN500–HV protocol, which uses a standard Universal Asynchronous
Receiver Transmitter (UART), transmits data in 8-bit groups (bytes).
This means that two bytes are required in order to make up one message
word.
A byte of information is transmitted as a sequence of 11 bits: one
start bit, 8 bits of data (least significant bit (LSB) first), one
parity bit (odd), and one stop bit. For each 16-bit data word, the
least significant byte is transmitted first, followed by the most
significant byte. Integer and floating point data types consisting of
more than one word are transmitted from the lowest numbered word to
the highest numbered word. The one exception to this rule is the time
tag, which is output in words 6-9 of each HV output message. The four
16-bit data words are in the following order: 2,1,4,3, where 1
represents the most significant word and 4 the least significant word.
Each word is separately byte-reversed.

start by opening the file
fin = open("20160128t184727_pps","rb")
then read in a frame
def read_frame(f_handle):
frame = f_handle.read(28) # 28 byte frame size
start_byte = 10
end_byte = 18 # 4 words each word is 2 bytes
timestamp_raw = frame[start_byte:end_byte]
timestamp_words = struct.unpack(">HHHH",timestamp_raw)
I could probably help more but I dont understand where the timestamp startbyte and endbyte is from your description as it does not seem to match the description you quoted ... I also do not know what the expected output value is ...if you provided those details I could probably help more

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.