Parse and split a hexadecimal bytearray in python

Parse and split a hexadecimal bytearray in python - python

I would like to parse and split a hexadecimal bytearray read from the vehicle CAN. The entire 64 bits were read as bytearray(b'\xa0\xc6\xa6\xc2\x06\xe3)B'), and I would like to split it based on bit positions. For example, I need the first 6 bits of 101000.
Based on some reading from here, I've completed the conversion from hex bytearray to binary string, and parse it successfully. My current approach is:
orig_data = bytearray(b'\xa0\xc6\xa6\xc2\x06\xe3)B')
def hex2bin(hex_string):
scale = 16
num_of_bits = 8
return bin(int(hex_string, scale))[2:].zfill(num_of_bits)
bin_str = hex2bin(bytes(orig_data).hex())
print(bin_str[:6])
Since I need to deal with a vast amount of data transferring at high speed, I was wondering if there is a faster way to do that than the current approach I adopted?

Related

Best way to encode a very long binaries string to a revertable simple form using Python?

-------------------------- add new-----------------------------
Let me fill more info here:
The actual situation is that I have this LONG STRING in environment-A, and I need to copy and paste it to environment-B;
UNFORTUNATELY, envir-A and envir-B are not connected (no mutual access), so I'm thinking about a way to encode/decode to represent it, otherwise for more files I have to input the string hand by hand----which is slow and not reproducible.
Any suggestion or gadget recommend?
Many thanks!
I'm facing a weird problem to encode a SUPER LONG binaries to a simple form, like several digits.
Say, there is a long string consist of only 1 and 0, e.g. "110...011" of length 1,000 to 100,000 or even more digits, and I would like to encode this STRING to something that has fewer digits/chars. Then I need to reverse it back to original STRING.
Currently I'am trying using hex / int method in Python to 'compress' this String, and 'decompress' it back to original form.
A example would be:
1.input string : '110011110110011'
'''
def Bi_to_Hex_Int(input_str, method ):
#2to16
if method=='hex':
string= str(input_str)
input_two= string
result= hex(int(input_two,2))
#2to10
if method=='int':
string= str(input_str)
input_two= string
result= int(input_two,2)
print("input_bi length",len(str(input_two)), "\n output hex length",len(str(result)),'\n method: {}'.format(method) )
return result
res_16 =Bi_to_Hex_Int(gene , 'hex')
=='0x67b3'
res_10 =Bi_to_Hex_Int(gene , 'int')
== 26547
'''
Then I can reverse it back:
'''
def HexInt_to_bi(input_str , method):
if method =='hex':
back_two = bin(int(input_str,16))
back_two = back_two[2:]
if method =='int':
back_two = bin( int(input_str ))
back_two = back_two[2:]
print("input_hex length",len(str(input_str)), "\n output bi length",len(str(back_two)) )
return back_two
hexback_two = HexInt_to_bi(res_16, 'hex')
intback_two = HexInt_to_bi(res_10 , 'int')
'''
BUT, this does have a problem, I tried around 500 digits of String:101010...0001(500d), the best 'compressed' result is around 127 digits by hex;
So is there a better way to further 'compress' string to fewer digits?
**Say 5,000 digits of string consist of 1s&0s, compress to 50/100 something of digits/chars(even lower) ** ??

If you want it that simple, say 1 hex character compresses 4 binary characters (2 ^ 4 = 16). Compression ratio you want is about 100 / 50 times. For 50 times you need 50 binary characters to be compressed into 1 character, means you require 2 ^ 50 different characters to encode any combination. Quite a lot that is.
If you accept lower ratio, you may try base64 like described here. Its compress ratio is 6 to 1.
Otherwise you have to come up with some complex algorithm like splitting your string into blocks, looking for similar amongst them, encoding them with different symbols, building a map of those symbols, etc.
Probably it's easier to compress your string with an archivator, then return a base64 representation of the result.
If task allows, you may store the whole strings somewhere and give them short unique names, so instead of compression and decompression you have to store and retrieve strings by names.

This probably doesn't produce the absolutely shortest string you can get, but it's trivially easy using the facilities built into Python. No need to convert the characters into a binary format, the zlib compression will convert an input with only 2 different characters into something optimal.
Encoding:
import zlib
import base64
result = base64.b64encode(zlib.compress(input_str.encode()))

If the count of 0 and 1 is significant different than you can use enumerative coding to get shortest representation

If the string consists only of 0 and 1 digits, then you can pack eight digits into one byte. You will also need to keep track of how many digits there are past the last multiple of eight, since the last byte may be representing fewer than eight digits.

base64 string length calculation when encoding unsigned integers only

I am trying to figure out estimates for how many unsigned integer numbers I can encode with 5 characters of base64, 6 characters, and so on.
Through programmatic approach I found out that I can encode
2^28 - 1 = 268,435,455
with 6 characters and
2^35 - 1 = 34,359,738,368
with 7 characters.
(-1 because I start at uint 1)
I am struggling to generalize this though, since I would assume it starts at 2^8 = 256 but I don't get how I end up at 28 and 35.
This is my implementation in Go
func Shorten(num uint64) string {
buf := make([]byte, binary.MaxVarintLen64)
n := binary.PutUvarint(buf, num)
b := buf[:n]
encoded := base64.URLEncoding.EncodeToString(b)
return strings.Replace(encoded, "=", "", -1)
}
Also
0 -> AA
128 -> gAE
16384 -> gIAB
2097152 -> gICAAQ
268435456 -> gICAgAE
So it looks like it's going up in 7 increments: 2^7, 2^14, 2^21, etc. but why 7?

A byte is 8 bits and therefore 256 possible values. Base 64 uses 64 different characters to encode and therefore is using 6 bits. so how many 8 bit objects can you fit in 6 bits? 0 if you're rounding or 3/4 if you aren't. When you start talking about encoding integers however your numbers do not appear to make sense. Are you talking about integers written in ascii? with 6 base64 characters you have 36 bits to play with so if you're talking about binary 32-bit unsigned integers you can encode one at a time but you can encode any of them that you want for 2**32 different possibilities and then 4 wasted bits. With ascii you'd have 4 characters so it would be 10000 different possibilities (0 to 9999).
You are getting unexpected results because you're using go varints which are not encoded as regular binary integers. some ipython output for you:
In [22]: base64.b64encode((128).to_bytes(1,'little'))
Out[22]: b'gA=='
because 128 can be encoded in a single 8 bit byte it is only 2 characters with some padding. and look at this:
In [3]: base64.b64decode('gAE=')
Out[3]: b'\x80\x01'
In [4]: int.from_bytes(_,'little')
Out[4]: 384
So as you can see PutUVarint isn't just encoding an integer of variable length it's encoding a variable integer, ie it has been encoded in a way that it can be decoded without knowing in advance what size it is. If you look at the source code for the varint go module it describes this process. Go is using 7 bits of each byte to hold actual integer binary data and the most significant bit is a flag as to whether or not there is more data yet to come. 128 is just the most significant bit of one byte set. So basically you're encoding twice based on the way you're accomplishing this task. If you have a given integer to encode it as a var int you need the number of bytes that the integer uses *8/7 to store the value then you base64 encode that result so you need that value *8/6 to store that. Depending on what you're doing with the base64 you can likely determine how many bytes you're playing with without needing to resort to the go varints and then the calculation would just be the 8/6 conversion (which is 4/3 I just left it in bits to match the varint process more closely.)

Write a Binary Sequence in ASCII to Binary python

I'm implementing Huffman Algorithm, but when I got the final compressed code, I've got a string similar to below:
10001111010010101010101
This is a binary code to created by the paths of my tree's leafs.
I have this sequence, but when I save it in a file, all that happens is system saving it as a ASCII on a file, which I can't compress because it has the same or bigger size than the original.
How do I save this binary directly?
PS: if I use some function to convert my string to binary, all I got is my ASCII converted to binary, so I did nothing, I need a real solution.

What you need to do is take each 8 bits and convert it into a byte to write out, looping until you have less than 8 bits remaining. Then save whatever's left over to prepend in front of the next value.
def binarize(bitstring):
wholebytes = len(bitstring) // 8
chars = [chr(int(bitstring[i*8:i*8+8], 2)) for i in range(wholebytes)]
remainder = bitstring[wholebytes*8:]
return ''.join(chars), remainder

I think you just want int() with a base value of 2:
my_string = "10001111010010101010101"
code_num = int( my_string, 2 )
Then write to a binary file. struct.pack additionally allows you to specify whatever byte order you like.
myfile = open("filename.txt",'wb')
mybytes = struct.pack( 'i', code_num )
myfile.write(mybytes)
myfile.close()
This method will also write some number of leading zeros, which could cause trouble for your Huffman codes.

Python unpack binary data, numeric of length 12

I have a file with big endian binaries. There are two numeric fields. The first has length 8 and the second length 12. How can I unpack the two numbers?
I am using the Python module struct (https://docs.python.org/2/library/struct.html) and it works for the first field
num1 = struct.unpack('>Q',payload[0:8])
but I don't know how I can unpack the second number. If I treat it as char(12), then I get something like '\x00\xe3AC\x00\x00\x00\x06\x00\x00\x00\x01'.
Thanks.

I think you should create a new string of bytes for the second number of length 16, fill the last 12 bytes with the string of bytes that hold your number and first 4 ones with zeros.
Then decode the bytestring with unpack with format >QQ, let's say to numHI, numLO variables. Then, you get final number with that: number = numHI * 2^64 + numLO*. AFAIR the integers in Python can be (almost) as large as you wish, so you will have no problems with overflows. That's only rough idea, please comment if you have problems with writing that in actual Python code, I'll then edit my answer to provide more help.
*^ is in this case the math power, so please use math.pow. Alternatively, you can use byte shift: number = numHI << 64 + numLO.

Getting Raw Binary Representation of a file in Python

I'd like to get the exact sequence of bits from a file into a string using Python 3. There are several questions on this topic which come close, but don't quite answer it. So far, I have this:
>>> data = open('file.bin', 'rb').read()
>>> data
'\xa1\xa7\xda4\x86G\xa0!e\xab7M\xce\xd4\xf9\x0e\x99\xce\xe94Y3\x1d\xb7\xa3d\xf9\x92\xd9\xa8\xca\x05\x0f$\xb3\xcd*\xbfT\xbb\x8d\x801\xfanX\x1e\xb4^\xa7l\xe3=\xaf\x89\x86\xaf\x0e8\xeeL\xcd|*5\xf16\xe4\xf6a\xf5\xc4\xf5\xb0\xfc;\xf3\xb5\xb3/\x9a5\xee+\xc5^\xf5\xfe\xaf]\xf7.X\x81\xf3\x14\xe9\x9fK\xf6d\xefK\x8e\xff\x00\x9a>\xe7\xea\xc8\x1b\xc1\x8c\xff\x00D>\xb8\xff\x00\x9c9...'
>>> bin(data[:][0])
'0b11111111'
OK, I can get a base-2 number, but I don't understand why data[:][x], and I still have the leading 0b. It would also seem that I have to loop through the whole string and do some casting and parsing to get the correct output. Is there a simpler way to just get the sequence of 01's without looping, parsing, and concatenating strings?
Thanks in advance!

I would first precompute the string representation for all values 0..255
bytetable = [("00000000"+bin(x)[2:])[-8:] for x in range(256)]
or, if you prefer bits in LSB to MSB order
bytetable = [("00000000"+bin(x)[2:])[-1:-9:-1] for x in range(256)]
then the whole file in binary can be obtained with
binrep = "".join(bytetable[x] for x in open("file", "rb").read())

If you are OK using an external module, this uses bitstring:
>>> import bitstring
>>> bitstring.BitArray(filename='file.bin').bin
'110000101010000111000010101001111100...'
and that's it. It just makes the binary string representation of the whole file.

It is not quite clear what the sequence of bits is meant to be. I think it would be most natural to start at byte 0 with bit 0, but it actually depends on what you want.
So here is some code to access the sequence of bits starting with bit 0 in byte 0:
def bits_from_char(c):
i = ord(c)
for dummy in range(8):
yield i & 1
i >>= 1
def bits_from_data(data):
for c in data:
for bit in bits_from_char(c):
yield bit
for bit in bits_from_data(data):
# process bit
(Another note: you would not need data[:][0] in your code. Simply data[0] would do the trick, but without copying the whole string first.)

To convert raw binary data such as b'\xa1\xa7\xda4\x86' into a bitstring that represents the data as a number in binary system (base-2) in Python 3:
>>> data = open('file.bin', 'rb').read()
>>> bin(int.from_bytes(data, 'big'))[2:]
'1010000110100111110110100011010010000110...'
See Convert binary to ASCII and vice versa.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.