I have a long bytearray
barray=b'\x00\xfe\x4b\x00...
What would be the best way to convert it to a list of 2-byte integers?
You can use the struct package for that:
from struct import unpack
tuple_of_shorts = unpack('h'*(len(barray)//2),barray)
This will produce signed shorts. For unsigned ones, use 'H' instead:
tuple_of_shorts = unpack('H'*(len(barray)//2),barray)
This produces on a little-endian machine for your sample input:
>>> struct.unpack('h'*(len(barray)//2),barray)
(-512, 75)
>>> struct.unpack('H'*(len(barray)//2),barray)
(65024, 75)
In case you want to work with big endian, or little endian, you can put a > (big endian) or < (little endian) in the specifications. For instance:
# Big endian
tuple_of_shorts = unpack('>'+'H'*(len(barray)//2),barray) # unsigned
tuple_of_shorts = unpack('>'+'h'*(len(barray)//2),barray) # signed
# Little endian
tuple_of_shorts = unpack('<'+'H'*(len(barray)//2),barray) # unsigned
tuple_of_shorts = unpack('<'+'h'*(len(barray)//2),barray) # signed
Generating:
>>> unpack('>'+'H'*(len(barray)//2),barray) # big endian, unsigned
(254, 19200)
>>> unpack('>'+'h'*(len(barray)//2),barray) # big endian, signed
(254, 19200)
>>> unpack('<'+'H'*(len(barray)//2),barray) # little endian, unsigned
(65024, 75)
>>> unpack('<'+'h'*(len(barray)//2),barray) # little endian, signed
(-512, 75)
Using the struct module:
import struct
count = len(barray)/2
integers = struct.unpack('H'*count, barray)
Depending on the endianness you may want to prepend a < or > for the unpacking format. And depending on signed/unsigned, it's h, or H.
Note, using the Python struct library to convert your array also allows you to specify a repeat count for each item in the format specifier. So 4H for example would be the same as using HHHH.
Using this approach avoids the need to create potentially massive format strings:
import struct
barray = b'\x00\xfe\x4b\x00\x4b\x00'
integers = struct.unpack('{}H'.format(len(barray)/2), barray)
print(integers)
Giving you:
(65024, 75, 75)
If memory efficiency is a concern, you may consider using an array.array:
>>> barr = b'\x00\xfe\x4b\x00'
>>> import array
>>> short_array = array.array('h', barr)
>>> short_array
array('h', [-512, 75])
This is like a space-efficient primitive array, with an OO-wrapper, so it supports sequence-type methods you would have on a list, like .append, .pop, and slicing!
>>> short_array[:1]
array('h', [-512])
>>> short_array[::-1]
array('h', [75, -512])
Also, recovering your bytes object becomes trivial:
>>> short_array
array('h', [-512, 75])
>>> short_array.tobytes()
b'\x00\xfeK\x00'
Note, if you want the opposite endianness from the native byte-order, use the in-place byteswap method:
>>> short_array.byteswap()
>>> short_array
array('h', [254, 19200])
Related
This question already has answers here:
Efficient way to swap bytes in python
(5 answers)
Closed 4 months ago.
I've created a buffer of words represented in little endian(Assuming each word is 2 bytes):
A000B000FF0A
I've separated the buffer to 3 words(2 bytes each)
A000
B000
FF0A
and after that converted to big endian representation:
00A0
00B0
0AFF
Is there a way instead of split into words to represent the buffer in big endian at once?
Code:
buffer='A000B000FF0A'
for i in range(0, len(buffer), 4):
value = endian(int(buffer[i:i + 4], 16))
def endian(num):
p = '{{:0{}X}}'.format(4)
hex = p.format(num)
bin = bytearray.fromhex(hex).reverse()
l = ''.join(format(x, '02x') for x in bin)
return int(l, 16)
Using the struct or array libraries are probably the easiest ways to do this.
Converting the hex string to bytes first is needed.
Here is an example of how it could be done:
from array import array
import struct
hex_str = 'A000B000FF0A'
raw_data = bytes.fromhex(hex_str)
print("orig string: ", hex_str.casefold())
# With array lib
arr = array('h')
arr.frombytes(raw_data)
# arr = array('h', [160, 176, 2815])
arr.byteswap()
array_str = arr.tobytes().hex()
print(f"Swap using array: ", array_str)
# With struct lib
arr2 = [x[0] for x in struct.iter_unpack('<h', raw_data)]
# arr2 = [160, 176, 2815]
struct_str = struct.pack(f'>{len(arr2) * "h"}', *arr2).hex()
print("Swap using struct:", struct_str)
Gives transcript:
orig string: a000b000ff0a
Swap using array: 00a000b00aff
Swap using struct: 00a000b00aff
You can use the struct to interpret your bytes as big or little endian. Then you can use the hex() method of the bytearray object to have a nice string representation.
Docs for struct.
import struct
# little endian
a = struct.pack("<HHH",0xA000,0xB000,0xFF0A)
# bih endian
b = struct.pack(">HHH",0xA000,0xB000,0xFF0A)
print(a)
print(b)
# convert back to string
print( a.hex() )
print( b.hex() )
Which gives:
b'\x00\xa0\x00\xb0\n\xff'
b'\xa0\x00\xb0\x00\xff\n'
00a000b00aff
a000b000ff0a
I have a binary file made from C structs that I want to parse in Python. I know the exact format and layout of the binary but I am confused on how to use Python Struct unpacking to read this data.
Would I have to traverse the whole binary unpacking a certain number of bytes at a time based on what the members of the struct are?
C File Format:
typedef struct {
int data1;
int data2;
int data4;
} datanums;
typedef struct {
datanums numbers;
char *name;
} personal_data;
Lets say the binary file had personal_data structs repeatedly after another.
Assuming the layout is a static binary structure that can be described by a simple struct pattern, and the file is just that structure repeated over and over again, then yes, "traverse the whole binary unpacking a certain number of bytes at a time" is exactly what you'd do.
For example:
record = struct.Struct('>HB10cL')
with open('myfile.bin', 'rb') as f:
while True:
buf = f.read(record.size)
if not buf:
break
yield record.unpack(buf)
If you're worried about the efficiency of only reading 17 bytes at a time and you want to wrap that up by buffering 8K at a time or something… well, first make sure it's an actual problem worth optimizing; then, if it is, loop over unpack_from instead of unpack. Something like this (untested, top-of-my-head code):
buf, offset = b'', 0
with open('myfile.bin', 'rb') as f:
if len(buf) < record.size:
buf, offset = buf[offset:] + f.read(8192), 0
if not buf:
break
yield record.unpack_from(buf, offset)
offset += record.size
Or, even simpler, as long as the file isn't too big for your vmsize, just mmap the whole thing and unpack_from on the mmap itself:
with open('myfile.bin', 'rb') as f:
with mmap.mmap(f, 0, access=mmap.ACCESS_READ) as m:
for offset in range(0, m.size(), record.size):
yield record.unpack_from(m, offset)
You can unpack a few at a time. Let's start with this example:
In [44]: a = struct.pack("iiii", 1, 2, 3, 4)
In [45]: a
Out[45]: '\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00'
If you're using a string, you can just use a subset of it, or use unpack_from:
In [49]: struct.unpack("ii",a[0:8])
Out[49]: (1, 2)
In [55]: struct.unpack_from("ii",a,0)
Out[55]: (1, 2)
In [56]: struct.unpack_from("ii",a,4)
Out[56]: (2, 3)
If you're using a buffer, you'll need to use unpack_from.
I have a large string more than 256 bits and and I need to byte swap it by 32 bits. But the string is in a hexadecimal base. When I looked at numpy and array modules I couldnt find the right syntax as to how to do the coversion. Could someone please help me?
An example:(thought the data is much longer.I can use pack but then I would have to convert the little endian to decimal and then to big endian first which seems like a waste):
Input:12345678abcdeafa
Output:78563412faeacdab
Convert the string to bytes, unpack big-endian 32-bit and pack little-endian 32-bit (or vice versa) and convert back to a string:
#!python3
import binascii
import struct
Input = b'12345678abcdeafa'
Output = b'78563412faeacdab'
def convert(s):
s = binascii.unhexlify(s)
a,b = struct.unpack('>LL',s)
s = struct.pack('<LL',a,b)
return binascii.hexlify(s)
print(convert(Input),Output)
Output:
b'78563412faeacdab' b'78563412faeacdab'
Generalized for any string with length multiple of 4:
import binascii
import struct
Input = b'12345678abcdeafa'
Output = b'78563412faeacdab'
def convert(s):
if len(s) % 4 != 0:
raise ValueError('string length not multiple of 4')
s = binascii.unhexlify(s)
f = '{}L'.format(len(s)//4)
dw = struct.unpack('>'+f,s)
s = struct.pack('<'+f,*dw)
return binascii.hexlify(s)
print(convert(Input),Output)
If they really are strings, just do string operations on them?
>>> input = "12345678abcdeafa"
>>> input[7::-1]+input[:7:-1]
'87654321afaedcba'
My take:
slice the string in N digit chunks
reverse each chunk
concatenate everything
Example:
>>> source = '12345678abcdeafa87654321afaedcba'
>>> # small helper to slice the input in 8 digit chunks
>>> chunks = lambda iterable, sz: [iterable[i:i+sz]
for i in range(0, len(iterable), sz)]
>>> swap = lambda source, sz: ''.join([chunk[::-1]
for chunk in chunks(source, sz)])
Output asked in the original question:
>>> swap(source, 8)
'87654321afaedcba12345678abcdeafa'
It is easy to adapt in order to match the required output after icktoofay edit:
>>> swap(swap(source, 8), 2)
'78563412faeacdab21436587badcaeaf'
A proper implementation probably should check if len(source) % 8 == 0.
I want to build a small formatter in python giving me back the numeric
values embedded in lines of hex strings.
It is a central part of my formatter and should be reasonable fast to
format more than 100 lines/sec (each line about ~100 chars).
The code below should give an example where I'm currently blocked.
'data_string_in_orig' shows the given input format. It has to be
byte swapped for each word. The swap from 'data_string_in_orig' to
'data_string_in_swapped' is needed. In the end I need the structure
access as shown. The expected result is within the comment.
Thanks in advance
Wolfgang R
#!/usr/bin/python
import binascii
import struct
## 'uint32 double'
data_string_in_orig = 'b62e000052e366667a66408d'
data_string_in_swapped = '2eb60000e3526666667a8d40'
print data_string_in_orig
packed_data = binascii.unhexlify(data_string_in_swapped)
s = struct.Struct('<Id')
unpacked_data = s.unpack_from(packed_data, 0)
print 'Unpacked Values:', unpacked_data
## Unpacked Values: (46638, 943.29999999943209)
exit(0)
array.arrays have a byteswap method:
import binascii
import struct
import array
x = binascii.unhexlify('b62e000052e366667a66408d')
y = array.array('h', x)
y.byteswap()
s = struct.Struct('<Id')
print(s.unpack_from(y))
# (46638, 943.2999999994321)
The h in array.array('h', x) was chosen because it tells array.array to regard the data in x as an array of 2-byte shorts. The important thing is that each item be regarded as being 2-bytes long. H, which signifies 2-byte unsigned short, works just as well.
This should do exactly what unutbu's version does, but might be slightly easier to follow for some...
from binascii import unhexlify
from struct import pack, unpack
orig = unhexlify('b62e000052e366667a66408d')
swapped = pack('<6h', *unpack('>6h', orig))
print unpack('<Id', swapped)
# (46638, 943.2999999994321)
Basically, unpack 6 shorts big-endian, repack as 6 shorts little-endian.
Again, same thing that unutbu's code does, and you should use his.
edit Just realized I get to use my favorite Python idiom for this... Don't do this either:
orig = 'b62e000052e366667a66408d'
swap =''.join(sum([(c,d,a,b) for a,b,c,d in zip(*[iter(orig)]*4)], ()))
# '2eb60000e3526666667a8d40'
The swap from 'data_string_in_orig' to 'data_string_in_swapped' may also be done with comprehensions without using any imports:
>>> d = 'b62e000052e366667a66408d'
>>> "".join([m[2:4]+m[0:2] for m in [d[i:i+4] for i in range(0,len(d),4)]])
'2eb60000e3526666667a8d40'
The comprehension works for swapping byte order in hex strings representing 16-bit words. Modifying it for a different word-length is trivial. We can make a general hex digit order swap function also:
def swap_order(d, wsz=4, gsz=2 ):
return "".join(["".join([m[i:i+gsz] for i in range(wsz-gsz,-gsz,-gsz)]) for m in [d[i:i+wsz] for i in range(0,len(d),wsz)]])
The input params are:
d : the input hex string
wsz: the word-size in nibbles (e.g for 16-bit words wsz=4, for 32-bit words wsz=8)
gsz: the number of nibbles which stay together (e.g for reordering bytes gsz=2, for reordering 16-bit words gsz = 4)
import binascii, tkinter, array
from tkinter import *
infile_read = filedialog.askopenfilename()
with open(infile, 'rb') as infile_:
infile_read = infile_.read()
x = (infile_read)
y = array.array('l', x)
y.byteswap()
swapped = (binascii.hexlify(y))
This is a 32 bit unsigned short swap i achieved with code very much the same as "unutbu's" answer just a little bit easier to understand. And technically binascii is not needed for the swap. Only array.byteswap is needed.
I have a UUID that I was thinking of packing into a struct using UUID.int, which turns it into a 128-bit integer. But none of the struct format characters are large enough to store it, how to go about doing this?
Sample code:
s = struct.Struct('L')
unique_id = uuid.uuid4()
tuple = (unique_id.int)
packed = s.pack(*tuple)
The problem is, struct format 'L' is only 4 bytes...I need to store 16. Storing it as a 32-char string is a bit much.
It is a 128-bit integer, what would you expect it to be turned into? You can split it into several components — e.g. two 64-bit integers:
max_int64 = 0xFFFFFFFFFFFFFFFF
packed = struct.pack('>QQ', (u.int >> 64) & max_int64, u.int & max_int64)
# unpack
a, b = struct.unpack('>QQ', packed)
unpacked = (a << 64) | b
assert u.int == unpacked
As you're using uuid module, you can simply use bytes member, which holds UUID as a 16-byte string (containing the six integer fields in big-endian byte order):
u = uuid.uuid4()
packed = u.bytes # packed is a string of size 16
assert u == uuid.UUID(bytes=packed)
TL;DR
struct.pack('<QBBHHL', *uuid_foo.fields[::-1])
Introduction
Even though Cat++'s answer is really great, it breaks the UUID in half to fit it into two unsigned long longs. I wanted to pack each field, which left me with the following:
def maxsize(size: typing.Union[int,str]):
""" Useful for playing with different struct.pack formats """
if isinstance(size, str):
size = struct.calcsize(size)
return 2 ** (4 * size) - 1
uuid_max = uuid.UUID('ffffffff-ffff-ffff-ffff-ffffffffffff')
tuple(maxsize(len(f)) for f in str(u).split('-'))
# (4294967295, 65535, 65535, 65535, 281474976710655)
uuid_max.fields
# (4294967295, 65535, 65535, 255, 255, 281474976710655)
uuid_foo = UUID('909822be-c5c4-432f-95db-da1be79cf067')
uuid_foo.fields
# (2425889470, 50628, 17199, 149, 219, 239813384794215)
The first five fields are easy since they already line up as unsigned 8, 4, 4, 2, 2 size integers. The last one required a little extra help from another answer.
Notes:
Padding is only automatically added between successive structure members. No padding is added at the beginning or the end of the encoded struct.
No padding is added when using non-native size and alignment, e.g. with ‘<’, ‘>’, ‘=’, and ‘!’.
To align the end of a structure to the alignment requirement of a particular type, end the format with the code for that type with a repeat count of zero. See Examples.
struct.pack('>LHHBBQ', *uuid_foo.fields)
# b'\x90\x98"\xbe\xc5\xc4C/\x95\xdb\x00\x00\xda\x1b\xe7\x9c\xf0g'
# ^^ ^^ these empty bytes won't work!
The actual answer
Since the last field is size 12, you'll have to pack it and unpack it backwards, little endian. That'll leave zeros at the end, instead of between the fifth and sixth fields.
struct.unpack('<QBBHHL', struct.pack('<QBBHHL', *uuid_foo.fields[::-1]))
# (281474976710655, 255, 255, 65535, 65535, 4294967295)
uuid_foo.fields
# (4294967295, 65535, 65535, 255, 255, 281474976710655)
Regenerating this requires you reverse it one more time.
uuid_packed = struct.pack('<QBBHHL', *uuid_foo.fields[::-1])
uuid_unpacked = struct.unpack('<QBBHHL', uuid_packed)[::-1]
uuid.UUID(fields=uuid_unpacked)
# UUID('909822be-c5c4-432f-95db-da1be79cf067')