Convert compression algorithm to decompression

Convert compression algorithm to decompression - python

sorry for the simple question, but it's blowing by brain, since I'm not good at data structure.
First, I have an initial binary file with compressed raw data. My colleague helped me out to turn the bytes into an array of decimals in Python (the code is given below and works just fine, showing the result as a chart in pyplot).
Now, I want to do the reverse operation, e.g. turn an array of decimal numbers into a binary file, but I'm totally stuck. Thank you very much in advance!
data_out = []
# decode 1st point
data_out.append(int.from_bytes(data_in[0:4], byteorder='big', signed=True))
i = 4
while i < len(data_in):
# get next byte
curr = int.from_bytes(data_in[i:i+1], byteorder='big', signed=False)
if curr < 255:
res = curr - 127
data_out.append(res + data_out[-1])
i = i + 1
else:
res = int.from_bytes(data_in[i+1:i+5], byteorder='little', signed=True)
data_out.append(res)
i = i + 5
from matplotlib import pyplot as plt
plt.plot(data_out)
plt.show()

The original stream of bytes was encoded as one or four-byte integers. The first value is sent as a four-byte integer. After the first value, you have either one byte in the range 0..254, which represents a difference of -127 to 127, or you have 255 followed by four-byte signed little-endian integer, which is the next value (not a difference). The idea is that if the integers are changing slowly from one to the next, this will compress the sequence by up to a factor of four by sending small differences as one byte instead of four. Though if you have too many differences that don't fit in a byte, this could expand the data by 25%, since non-difference values take five bytes instead of four.
To encode such a stream, you start by encoding the first value directly as four bytes, little endian. For each subsequent value, you subtract the previous value from this one. If the result is in the range -127 to 127, then add 127 and send that byte. Otherwise send a 255 byte, followed by the value (not the difference) as a four-byte signed little-endian integer.
As pointed out by #greybeard, there is an error in your colleague's code (assuming it was copied here correctly), in that res is not initialized. The first point decoding needs to be:
# decode 1st point
res = int.from_bytes(data_in[0:4], byteorder='big', signed=True)
data_out.append(res)

Related

Pack into c types and obtain the binary value back

I'm using the following code to pack an integer into an unsigned short as follows,
raw_data = 40
# Pack into little endian
data_packed = struct.pack('<H', raw_data)
Now I'm trying to unpack the result as follows. I use utf-16-le since the data is encoded as little-endian.
def get_bin_str(data):
bin_asc = binascii.hexlify(data)
result = bin(int(bin_asc.decode("utf-16-le"), 16))
trimmed_res = result[2:]
return trimmed_res
print(get_bin_str(data_packed))
Unfortunately, it throws the following error,
result = bin(int(bin_asc.decode("utf-16-le"), 16)) ValueError: invalid
literal for int() with base 16: '㠲〰'
How do I properly decode the bytes in little-endian to binary data properly?

Use unpack to reverse what you packed. The data isn't UTF-encoded so there is no reason to use UTF encodings.
>>> import struct
>>> data_packed = struct.pack('<H', 40)
>>> data_packed.hex() # the two little-endian bytes are 0x28 (40) and 0x00 (0)
2800
>>> data = struct.unpack('<H',data_packed)
>>> data
(40,)
unpack returns a tuple, so index it to get the single value
>>> data = struct.unpack('<H',data_packed)[0]
>>> data
40
To print in binary use string formatting. Either of these work work best. bin() doesn't let you specify the number of binary digits to display and the 0b needs to be removed if not desired.
>>> format(data,'016b')
'0000000000101000'
>>> f'{data:016b}'
'0000000000101000'

You have not said what you are trying to do, so let's assume your goal is to educate yourself. (If you are trying to pack data that will be passed to another program, the only reliable test is to check if the program reads your output correctly.)
Python does not have an "unsigned short" type, so the output of struct.pack() is a byte array. To see what's in it, just print it:
>>> data_packed = struct.pack('<H', 40)
>>> print(data_packed)
b'(\x00'
What's that? Well, the character (, which is decimal 40 in the ascii table, followed by a null byte. If you had used a number that does not map to a printable ascii character, you'd see something less surprising:
>>> struct.pack("<H", 11)
b'\x0b\x00'
Where 0b is 11 in hex, of course. Wait, I specified "little-endian", so why is my number on the left? The answer is, it's not. Python prints the byte string left to right because that's how English is written, but that's irrelevant. If it helps, think of strings as growing upwards: From low memory locations to high memory. The least significant byte comes first, which makes this little-endian.
Anyway, you can also look at the bytes directly:
>>> print(data_packed[0])
40
Yup, it's still there. But what about the bits, you say? For this, use bin() on each of the bytes separately:
>>> bin(data_packed[0])
'0b101000'
>>> bin(data_packed[1])
'0b0'
The two high bits you see are worth 32 and 8. Your number was less than 256, so it fits entirely in the low byte of the short you constructed.
What's wrong with your unpacking code?
Just for fun let's see what your sequence of transformations in get_bin_str was doing.
>>> binascii.hexlify(data_packed)
b'2800'
Um, all right. Not sure why you converted to hex digits, but now you have 4 bytes, not two. (28 is the number 40 written in hex, the 00 is for the null byte.) In the next step, you call decode and tell it that these 4 bytes are actually UTF-16; there's just enough for two unicode characters, let's take a look:
>>> b'2800'.decode("utf-16-le")
'㠲〰'
In the next step Python finally notices that something is wrong, but by then it does not make much difference because you are pretty far away from the number 40 you started with.
To correctly read your data as a UTF-16 character, call decode directly on the byte string you packed.
>>> data_packed.decode("utf-16-le")
'('
>>> ord('(')
40

Verifying CRC32 of UDP given .jpg file of payload

I'm running a server that receives UDP packets that contain a 2 byte CRC32 polynomial and a variable number of XOR'd DWORDs corresponding to a .jpg file. The packets also contain the index of the corresponding DWORD in the .jpg file for each DWORD in the packet. I am also given the actual .jpg file.
For example, the packet could contain 10 DWORDs and specify the starting index as 3, so we can expect the received DWORDs to correspond with the 4th through 11th DWORDs making up the .jpg.
I want to verify the integrity of each of the DWORDs by comparing their CRC32 values against the CRC32 values of the corresponding DWORDs in the .jpg.
I thought that the proper way to do this would be to divide each DWORD in the packet and its corresponding DWORD in the .jpg by the provided CRC polynomial and analyze the remainder. If the remainders are the same after doing these divisions, then there is no problem with the packet. However, even with packets that are guaranteed to be correct, these remainders are never equal.
Here is how I'm reading the bytes of the actual .jpg and splitting them up into DWORDs:
def split(data):
# Split the .jpg data into DWORDs
chunks = []
for i in range(0, len(data), 4):
chunks.append(data[i: i + 4])
return chunks
def get_image_bytes():
with open("dog.jpg", "rb") as image:
f = image.read()
jpg_bytes = split(f)
return jpg_bytes
Now I have verified my split() function works and to my knowledge, get_image_bytes() reads the .jpg correctly by calling image.read().
After receiving a packet, I convert each DWORD to binary and perform the mod 2 division like so:
jpg_bytes = get_image_bytes()
crc_key_bin = '1000110111100' # binary representation of the received CRC32 polynomial
d_words = [b'\xc3\xd4)v', ... , b'a4\x96\xbb']
iteration = 0 # For simplicity, assume the packet specified that the starting index is 0
for d in d_words:
d_bin = format(int(d.hex(), 16), "b") # binary representation of the DWORD from the packet
jpg_dword = format(int(jpg_bytes[iteration].hex(), 16), "b") # binary representation of the corresponding DWORD in dog.jpg
remainder1 = mod2div(d_bin, crc_key_bin) # <--- These remainders should be
remainder2 = mod2div(jpg_dword, crc_key_bin) # <--- equal, but they're not!
iteration += 1
I have tested the mod2div() function, and it returns the expected remainder after performing mod 2 division.
Where am I going wrong? I'm expecting the 2 remainders to be equal, but they never are. I'm not sure if the way I'm reading the bytes from the .jpg file is incorrect, if I'm performing the mod 2 division with the wrong values, or if I'm completely misunderstanding how to verify the CRC32 values. I'd appreciate any help.

First off, there's no such thing as a "2 byte CRC32 polynomial". A 32-bit CRC needs 32-bits to specify the polynomial.
Second, a CRC polynomial is something that is fixed for a given protocol. Why is a CRC polynomial being transmitted, as opposed to simply specified? Are you sure it's the polynomial? Where is this all documented?
What does "XOR'd DWORDs" means? Exclusive-or'd with what?
And, yes, I think you are completely misunderstanding how to verify CRC values. All you need to do is calculate the check values on the message the same way it was done at the other end, and compare that to the check values that were transmitted. (That is true for any check value, not just CRCs.) However I cannot tell from your description what was calculated on what, or how.

Convert binary to signed, little endian 16bit integer in Python

Trying to a convert a binary list into a signed 16bit little endian integer
input_data = [['1100110111111011','1101111011111111','0010101000000011'],['1100111111111011','1101100111111111','0010110100000011']]
Desired Output =[[-1074, -34, 810],[-1703, -39, 813]]
This is what I've got so far. It's been adapted from: Hex string to signed int in Python 3.2?,
Conversion from HEX to SIGNED DEC in python
results = []
for i in input_data:
hex_convert = [hex(int(x,2)) for x in i]
convert = [int(y[4:6] + y[2:4], 16) for y in hex_convert]
results.append(convert)
print (results)
output: [[64461, 65502, 810], [64463, 65497, 813]]
This is works fine, but the above are unsigned integers. I need signed integers capable of handling negative values. I then tried a different approach:
results_2 = []
for i in input_data:
hex_convert = [hex(int(x,2)) for x in i]
to_bytes = [bytes(j, 'utf-8') for j in hex_convert]
split_bits = [int(k, 16) for k in to_bytes]
convert_2 = [int.from_bytes(b, byteorder = 'little', signed = True) for b in to_bytes]
results_2.append(convert_2)
print (results_2)
Output: [[108191910426672, 112589973780528, 56282882144304], [108191943981104, 112589235583024, 56282932475952]]
This result is even more wild than the first. I know my approach is wrong, and it doesn't help that i've never been able to get my head around binary conversion etc, but I feel i'm on the right path with:
(b, byteorder = 'little', signed = True)
but can't work out where i'm wrong. Any help explaining this concept would be greatly appreciated.

This result is even more wild than the first. I know my approach is wrong... but can't work out where i'm wrong.
The problem is in the conversion to bytes. Let's look at it a step at a time:
int(x, 2)
Fine; we treat the string as a base-2 representation of the integer value, and get that integer. Only problem is it's a) unsigned and b) big-endian.
hex(int(x,2))
What this does is create a string representation of the integer, in base 16, with a 0x prefix. Notably, there are two text characters per byte that we want. This is already heading is down the wrong path.
You might have thought of using hexadecimal because you've seen \xAB style escapes inside string representations. This is a completely different thing. The string '\xAB' contains one character. The string '0xAB' contains four.
From there, everything else is still nonsense. Converting to bytes with a text encoding just means that the text character 0 for example is replaced with the byte value 48 (since in UTF-8 it's encoded with a single byte with that value). For this data we get the same results with UTF-8 that we would by assuming plain ASCII (since UTF-8 is "ASCII transparent" and there are no non-ASCII characters in the text).
So how do we do it?
We want to convert the integer from the first step into the bytes used to represent it. Just as there is a .from_bytes class method allowing us to create an integer from underlying bytes, there is an instance method allowing us to get the bytes that would represent the integer.
So, we use .to_bytes, specifying the length, signedness and endianness that was assumed when we created the int from the binary string - that gives us bytes that correspond to that string. Then, we re-create the integer from those bytes, except now specifying the proper signedness and endianness. The reason that .to_bytes makes us specify a length is because the integer doesn't have a particular length - there are a minimum number of bytes required to represent it, but you could use as many more as you like. (This is especially important if you want to handle signed values, since it will do sign-extension automatically.)
Thus:
for i in input_data:
values = [int(x,2) for x in i]
as_bytes = [x.to_bytes(2, byteorder='big', signed=False) for x in values]
reinterpreted = [int.from_bytes(x, byteorder='little', signed=True) for x in as_bytes]
results_2.append(reinterpreted)
But let's improve the organization of the code a bit. I will first make a function to handle a single integer value, and then we can use comprehensions to process the list. In fact, we can use nested comprehensions for the nested list.
def as_signed_little(binary_str):
# This time, taking advantage of positional args and default values.
as_bytes = int(binary_str, 2).to_bytes(2, 'big')
return int.from_bytes(as_bytes, 'little', signed=True)
# And now we can do:
results_2 = [[as_signed_little(x) for x in i] for i in input_data]

How do I loop through every possible value of a byte in python?

I am conducting a Padding Oracle Attack for my Information Security course. Without getting into the details of the attack, I need to have a for loop that loops through all possible 1 byte hex values.
Pseudo-code of what I need to do:
for x in range('\x00', '\xFF'):
replace last byte of ciphertext with byte
perform padding check
I cannot figure out how to accomplish this. Any ideas?

Bytes are really just integers in the range 0-255 (inclusive), or in hex, 0 through to FF. Generate the integer, then create the byte value from that. The bytes() type takes a list of integers, so create a list of length 1:
for i in range(0xff):
b = bytes([i])
If you store the ciphertext in a bytearray() object, you could even trivially replace that last byte by using the integer directly:
ciphertext_mutable = bytearray(ciphertext)
for i in range(0xff):
ciphertext_mutable[-1] = i # replace last byte

base64 encode & decode questions where a != b

Given this example in Python
sample = '5PB37L2CH5DUDWN2SUOYE6LJPYCJBFM5N2FGVEHF7HD224UR52KB===='
a = base64.b32decode(sample)
b = base64.b32encode(a)
why is it that
sample != b ?
BUT where
sample = '5PB37L2CH5DUDWN2SUOYE6LJPYCJBFM5N2FGVEHF7HD224UR52KBAAAA'
then
sample == b

the first sample you got there is invalid base64.
taken from wiki:
When the number of bytes to encode is not divisible by 3 (that is, if there are only one or two bytes of input for the last block), then the following action is performed: Add extra bytes with value zero so there are three bytes, and perform the conversion to base64. If there was only one significant input byte, only the first two base64 digits are picked, and if there were two significant input bytes, the first three base64 digits are picked. '=' characters might be added to make the last block contain four base64 characters.
http://en.wikipedia.org/wiki/Base64#Examples
edit:
taken from RFC 4648:
Special processing is performed if fewer than 24 bits are available
at the end of the data being encoded. A full encoding quantum is
always completed at the end of a quantity. When fewer than 24 input
bits are available in an input group, bits with value zero are added
(on the right) to form an integral number of 6-bit groups. Padding
at the end of the data is performed using the '=' character.
4 times 8bits (the ='s) (at the end of your sample) is more than 24bits so they are at the least unneccessary. (not sure what datatype sample is, but find out and take it's size times number of characters divided by 24)
about your particular sample:
base-encoding reads in 24bit chunks and only needs '=' padding characters at the end of the base'd string to make whatever was left of the string after splitting it into 24bit chunks be "of size 24" so it can be parsed by the decoder.
since the ===='s at the end of your string amount to more than 24bits they are useless, hence: invalid...

First, let's be clear: your question is about base32, not base64.
Your original sample is a bit too long. There are 4 = padding at the end, meaning at least 20 bits of padding. The number of bits must be a multiple of 8 so it's really 24 bits. The encoding for B in base32 is 1, which means one of the padding bits is set. This is a violation of the spec, which says all the padding bits must be clear. The decode drops the bit completely, and the encode produces the proper value A instead of B.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert compression algorithm to decompression - python

Related

Pack into c types and obtain the binary value back

Verifying CRC32 of UDP given .jpg file of payload

Convert binary to signed, little endian 16bit integer in Python

How do I loop through every possible value of a byte in python?

base64 encode & decode questions where a != b

Categories

Resources