Packing unaligned bytes

Packing unaligned bytes - python

I have a Perl script that creates a binary input file that we pass to another set of code. The binary input file consists of hundreds of parameters that are of various lengths. Most are either 8, 16, or 32 bits. I'm trying to convert the Perl script over to Python, and what is tripping me up are the few parameters that are 24 bits long.
I looked at this forum post, it was close, but not quite what I need.
For example. Lets say the input value is an integer (10187013). How do I pack that down to 3 bytes? If I do something like this:
hexVars = struct.pack("<L", 10187013)
And then write it out to the binary file:
binout = open(binFile, "wb")
binout.write(hexVars)
It, as expected, prints out four bytes 05 71 9b 00, what I want is 05 71 9b. Can I force it to pack only 3 bytes? Or somehow remove the last byte before writing it out?

Packing into an L always gives you 4 bytes -- because that is the meaning of L. Use 3 separate variables (each one 1 byte), or, since you are converting to a string anyway, just lop off the fourth, unused, byte:
import struct
hexVars = struct.pack("<L", 10187013)[:3]
print (len(hexVars))
print (ord(hexVars[0]),ord(hexVars[1]),ord(hexVars[2]))
binout = open('binFile', "wb")
binout.write(hexVars)
Contents of binFile is as expected:
(Tested; this code works with Python 2.7 as well as 3.6.)

Related

Passing a sequence of bits to a file python

As a part of a bigger project, I want to save a sequence of bits in a file so that the file is as small as possible. I'm not talking about compression, I want to save the sequence as it is but using the least amount of characters. The initial idea was to turn mini-sequences of 8 bits into chars using ASCII encoding and saving those chars, but due to some unknown problem with strange characters, the characters retrieved when reading the file are not the same that were originally written. I've tried opening the file with utf-8 encoding, latin-1 but none seems to work. I'm wondering if there's any other way, maybe by turning the sequence into a hexadecimal number?

technically you can not write less than a byte because the os organizes memory in bytes (write individual bits to a file in python), so this is binary file io, see https://docs.python.org/2/library/io.html there are modules like struct
open the file with the 'b' switch, indicates binary read/write operation, then use i.e. the to_bytes() function (Writing bits to a binary file) or struct.pack() (How to write individual bits to a text file in python?)
with open('somefile.bin', 'wb') as f:
import struct
>>> struct.pack("h", 824)
'8\x03'
>>> bits = "10111111111111111011110"
>>> int(bits[::-1], 2).to_bytes(4, 'little')
b'\xfd\xff=\x00'
if you want to get around the 8 bit (byte) structure of the memory you can use bit manipulation and techniques like bitmasks and BitArrays
see https://wiki.python.org/moin/BitManipulation and https://wiki.python.org/moin/BitArrays
however the problem is, as you said, to read back the data if you use BitArrays of differing length i.e. to store a decimal 7 you need 3 bit 0x111 to store a decimal 2 you need 2 bit 0x10. now the problem is to read this back.
how can your program know if it has to read the value back as a 3 bit value or as a 2 bit value ? in unorganized memory the sequence decimal 72 looks like 11110 that translates to 111|10 so how can your program know where the | is ?
in normal byte ordered memory decimal 72 is 0000011100000010 -> 00000111|00000010 this has the advantage that it is clear where the | is
this is why memory on its lowest level is organized in fixed clusters of 8 bit = 1 byte. if you want to access single bits inside a bytes/ 8 bit clusters you can use bitmasks in combination with logic operators (http://www.learncpp.com/cpp-tutorial/3-8a-bit-flags-and-bit-masks/). in python the easiest way for single bit manipulation is the module ctypes
if you know that your values are all 6 bit maybe it is worth the effort, however this is also tough...
(How do you set, clear, and toggle a single bit?)
(Why can't you do bitwise operations on pointer in C, and is there a way around this?)

Writing binary data to a file in Python

I am trying to write data (text, floating point data) to a file in binary, which is to be read by another program later. The problem is that this program (in Fort95) is incredibly particular; each byte has to be in exactly the right place in order for the file to be read correctly. I've tried using Bytes objects and .encode() to write, but haven't had much luck (I can tell from the file size that it is writing extra bytes of data). Some code I've tried:
mgcnmbr='42'
bts=bytes(mgcnmbr)
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(bts)
test_file.close()
I've also tried:
mgcnmbr='42'
bts=mgcnmbr.encode(utf_32_le)
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(bts)
test_file.close()
To clarify, what I need is the integer value 42, written as a 4 byte binary. Next, I would write the numbers 1 and 0 in 4 byte binary. At that point, I should have exactly 12 bytes. Each is a 4 byte signed integer, written in binary. I'm pretty new to Python, and can't seem to get it to work out. Any suggestions? Soemthing like this? I need complete control over how many bytes each integer (and later, 4 byte floating point ) is.
Thanks

You need the struct module.
import struct
fout = open('test.dat', 'wb')
fout.write(struct.pack('>i', 42))
fout.write(struct.pack('>f', 2.71828182846))
fout.close()
The first argument in struct.pack is the format string.
The first character in the format string dictates the byte order or endianness of the data (Is the most significant or least significant byte stored first - big-endian or little-endian). Endianness varies from system to system. If ">" doesn't work try "<".
The second character in the format string is the data type. Unsurprisingly the "i" stands for integer and the "f" stands for float. The number of bytes is determined by the type. Shorts or "h's" for example are two bytes long. There are also codes for unsigned types. "H" corresponds to an unsigned short for instance.
The second argument in struct.pack is of course the value to be packed into the bytes object.
Here's the part where I tell you that I lied about a couple of things. First I said that the number of bytes is determined by the type. This is only partially true. The size of a given type is technically platform dependent as the C/C++ standard (which the struct module is based on) merely specifies minimum sizes. This leads me to the second lie. The first character in the format string also encodes whether the standard (minimum) number of bytes or the native (platform dependent) number of bytes is to be used. (Both ">" and "<" guarantee that the standard, minimum number of bytes is used which is in fact four in the case of an integer "i" or float "f".) It additionally encodes the alignment of the data.
The documentation on the struct module has tables for the format string parameters.
You can also pack multiple primitives into a single bytes object and realize the same result.
import struct
fout = open('test.dat', 'wb')
fout.write(struct.pack('>if', 42, 2.71828182846))
fout.close()
And you can of course parse binary data with struct.unpack.

Assuming that you want it in little-endian, you could do something like this to write 42 in a four byte binary.
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(b'\xA2\0\0\0')
test_file.close()
A2 is 42 in hexadecimal, and the bytes '\xA2\0\0\0' makes the first byte equal to 42 followed by three empty bytes. This code writes the byte: 42, 0, 0, 0.
Your code writes the bytes to represent the character '4' in UTF 32 and the bytes to represent 2 in UTF 32. This means it writes the bytes: 52, 0, 0, 0, 50, 0, 0, 0, because each character is four bytes when encoded in UTF 32.
Also having a hex editor for debugging could be useful for you, then you could see the bytes that your program is outputting and not just the size.

In my problem Write binary string in binary file Python 3.4 I do like this:
file.write(bytes(chr(int(mgcnmbr)), 'iso8859-1'))

Start at offset from beginning of file and unpack 4 byte word

With Python, given an offset say 250 bytes, how would I jump to this position in a file and store a 32bit binary value?
My issue is read() returns a string and I'm not sure if I'm able to properly advance to the valid offset having done that. Also, experimenting with struct.unpack() it's demanding a length equivalent to the specified format. How do I grab only the immediate following data according to what's expected of the specified format? And what's the format for a 32bit int?
Ex. I wrote a string >32 characters and thought I could grab the initial 32 bits and store them as a single 32 bit int by using '<qqqq', this was incorrect needless to say.

with open("input.bin","rb") as f:
f.seek(250) #offset
print struct.unpack("<l",f.read(4)) #grabs one little endian 32 bit long
if you wanted 4, 32 bit ints you would use
print struct.unpack("<llll",f.read(16))
if you just want to grab the next 32bit int
print struct.unpack_from("<l",f)[0]

Appending data bytes to binary file with Python

I want to append a crc that I calculate to an existing binary file.
For example, the crc is 0x55667788.
I want to append 0x55, 0x66, 0x77 and 0x88 to the end of the file.
For example, if I open the file in HexEdit, the last four bytes of the file
will show 0x55667788.
Here is my code so far:
fileopen = askopenfilename()
filename = open(fileopen, 'rb+')
filedata = filename.read()
filecrc32 = hex(binascii.crc32(filedata))
filename.seek(0,2)
filename.write(filecrc32)
filename.close()
I get the following error:
File "C:\Users\cjackel\openfile.py", line 9, in <module>
filename.write(filecrc32)
TypeError: 'str' does not support the buffer interface
Any suggestions?

The hex function returns a string. In this case, you've got a string of 10 hex characters representing your 4-byte number, like this:
'0x55667788'
In Python 2.x, you would be allowed to write this incorrect data to a binary file (it would show up as the 10 bytes 30 78 35 35 36 36 37 37 38 38 rather than the four bytes you want, 55 66 77 88). Python 3.x is smarter, and only allows you to write bytes (or bytearray or similar) to a binary file, not str.
What you want here is not the hex string, but the actual bytes.
The way you described the bytes you want is called big-endian order. On most computers, the "native" order is the opposite, little-endian, which would give you 0x88776655 instead of 0x55667788.
In Python 3.2+, the simplest way to get that is the int.to_bytes method:
filecrc = binascii.crc32(filedata).to_bytes(4, byteorder='big', signed=False)
(The signed=False isn't really necessary, because it's the default, but it's a good way of making it explicit that you're definitely dealing with an unsigned 32-bit integer.)
If you're stuck with earlier versions, you can use the struct module:
filecrc = struct.pack('>I', binascii.crc32(filedata))
The > means big-endian, and the I means unsigned 4-byte integer. So, this returns the same thing. In either case, what you get is b'\x55\x66\x77\x88' (or, as Python would repr it, b'\Ufw\x88').
The error is a bit cryptic, because no novice is going to have any idea what "the buffer interface" is (especially since the 3.x documentation calls it the Buffer Protocol, and it's only documented as part of CPython's C extension API…), but effectively it means that you need a bytes-like object. Usually, this error will mean that you just forgot to encode your string to UTF-8 or some other encoding. But when you were trying to write actual binary data rather than encoded text, it's the same error.

You need to serialize the data. Serialization is the process of getting
the relevant bytes from the whole number. In your case, your CRC is a
4-byte number. The individual 4 bytes can be retrieved to a list as below:
serialized_crc = [(filecrc32 >> 24) & 0xFF,(filecrc32 >> 16) & 0xFF,
(filecrc32 >> 8) & 0xFF,filecrc32 & 0xFF]
The CRC can be then written to the file by converting to a bytearray as below:
filename.write(bytearray(serialized_crc))

Python, how to put 32-bit integer into byte array

I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.

What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data

Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.

The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g

You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.

The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.