Appending data bytes to binary file with Python - python

I want to append a crc that I calculate to an existing binary file.
For example, the crc is 0x55667788.
I want to append 0x55, 0x66, 0x77 and 0x88 to the end of the file.
For example, if I open the file in HexEdit, the last four bytes of the file
will show 0x55667788.
Here is my code so far:
fileopen = askopenfilename()
filename = open(fileopen, 'rb+')
filedata = filename.read()
filecrc32 = hex(binascii.crc32(filedata))
filename.seek(0,2)
filename.write(filecrc32)
filename.close()
I get the following error:
File "C:\Users\cjackel\openfile.py", line 9, in <module>
filename.write(filecrc32)
TypeError: 'str' does not support the buffer interface
Any suggestions?

The hex function returns a string. In this case, you've got a string of 10 hex characters representing your 4-byte number, like this:
'0x55667788'
In Python 2.x, you would be allowed to write this incorrect data to a binary file (it would show up as the 10 bytes 30 78 35 35 36 36 37 37 38 38 rather than the four bytes you want, 55 66 77 88). Python 3.x is smarter, and only allows you to write bytes (or bytearray or similar) to a binary file, not str.
What you want here is not the hex string, but the actual bytes.
The way you described the bytes you want is called big-endian order. On most computers, the "native" order is the opposite, little-endian, which would give you 0x88776655 instead of 0x55667788.
In Python 3.2+, the simplest way to get that is the int.to_bytes method:
filecrc = binascii.crc32(filedata).to_bytes(4, byteorder='big', signed=False)
(The signed=False isn't really necessary, because it's the default, but it's a good way of making it explicit that you're definitely dealing with an unsigned 32-bit integer.)
If you're stuck with earlier versions, you can use the struct module:
filecrc = struct.pack('>I', binascii.crc32(filedata))
The > means big-endian, and the I means unsigned 4-byte integer. So, this returns the same thing. In either case, what you get is b'\x55\x66\x77\x88' (or, as Python would repr it, b'\Ufw\x88').
The error is a bit cryptic, because no novice is going to have any idea what "the buffer interface" is (especially since the 3.x documentation calls it the Buffer Protocol, and it's only documented as part of CPython's C extension APIā€¦), but effectively it means that you need a bytes-like object. Usually, this error will mean that you just forgot to encode your string to UTF-8 or some other encoding. But when you were trying to write actual binary data rather than encoded text, it's the same error.

You need to serialize the data. Serialization is the process of getting
the relevant bytes from the whole number. In your case, your CRC is a
4-byte number. The individual 4 bytes can be retrieved to a list as below:
serialized_crc = [(filecrc32 >> 24) & 0xFF,(filecrc32 >> 16) & 0xFF,
(filecrc32 >> 8) & 0xFF,filecrc32 & 0xFF]
The CRC can be then written to the file by converting to a bytearray as below:
filename.write(bytearray(serialized_crc))

Related

Writing bits (from a bit string) to create a binary file in python

I am new to python. Here is what I am struggling to implement.
I have a very long bit string "1010101010101011111010101010111100001101010101011 ... ".
I want to write this as bits and create a binary file using python.
(Later I want disassemble this using IDA, this is not important for this question).
Is there any way I can write to a file at bit level (as binary)? Or do I have to convert it to bytes first and then write byte by byte? What is the best approach.
Yes, you have to convert it to bytes first and then write those bytes to a file. Working on a per byte basis is probably also the best idea to keep control over the ordering of your bytes (big vs. little endian), etc.
You can use int("10101110", 2) to easily convert a bit string to a numeric value. Then use a bytearray to create a sequence of all your byte values. The result will look something like this:
s = "1010101010101011111010101010111100001101010101011"
i = 0
buffer = bytearray()
while i < len(s):
buffer.append( int(s[i:i+8], 2) )
i += 8
# now write your buffer to a file
with open(my_file, 'bw') as f:
f.write(buffer)

How can I extract mixed binary and ascii values from a bytes string like I did in 2.x?

The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.

Packing unaligned bytes

I have a Perl script that creates a binary input file that we pass to another set of code. The binary input file consists of hundreds of parameters that are of various lengths. Most are either 8, 16, or 32 bits. I'm trying to convert the Perl script over to Python, and what is tripping me up are the few parameters that are 24 bits long.
I looked at this forum post, it was close, but not quite what I need.
For example. Lets say the input value is an integer (10187013). How do I pack that down to 3 bytes? If I do something like this:
hexVars = struct.pack("<L", 10187013)
And then write it out to the binary file:
binout = open(binFile, "wb")
binout.write(hexVars)
It, as expected, prints out four bytes 05 71 9b 00, what I want is 05 71 9b. Can I force it to pack only 3 bytes? Or somehow remove the last byte before writing it out?
Packing into an L always gives you 4 bytes -- because that is the meaning of L. Use 3 separate variables (each one 1 byte), or, since you are converting to a string anyway, just lop off the fourth, unused, byte:
import struct
hexVars = struct.pack("<L", 10187013)[:3]
print (len(hexVars))
print (ord(hexVars[0]),ord(hexVars[1]),ord(hexVars[2]))
binout = open('binFile', "wb")
binout.write(hexVars)
Contents of binFile is as expected:
(Tested; this code works with Python 2.7 as well as 3.6.)

Writing hex value into file Python

What I am really doing is creating a BMP file from JPEG using python and it's got some header data which contains info like size, height or width of the image, so basically I want to read a JPEG file, gets it width and height, calculate the new size of a BMP file and store it in the header.
Let's say the new size of the BMP file is 40000 bytes whose hex value is 0x9c40, now as there is 4 byte space to save this in the header, we can write it as 0x00009c40. In BMP header data, LSB is written first and then MSB so I have to write, 0x409c0000 in the file.
My Problems:-
I was able to do this in C but I am totally lost how to do so in Python.
For example, if I have i=40000, and by using str=hex(i)[2:] I got the hex value, now by some coding I was able to add the extra zeros and then reverse the code. Now how to write this '409c0000' data in the file as hex?
The header size is 54 bytes for BMP file, so is there is another way to just store the data in a string like str='00ffcf4f...'(upto 54 bytes) and just convert the whole str at once as hex and write it to file?
My friend told me to use unhexlify from binascii,
by doing unhexlify('fffcff') I get '\xff\xfc\xff' which is what I want but when I try unhexlify('3000') I get '0\x00'` which is not what I want. It is same for any value containing 3, 4, 5, 6 or 7. Is it the right way to do this?
You are not writing hex, you are writing binary data. Hexadecimal is a helpful notation when dealing with binary data, but don't confuse the notation with the value.
Use the struct module to pack integer data into binary structures, the same way C would.
binascii.unhexlify also is a good choice, provided you already have the data in a string using hex notation. The output is correct, but the binary representation only uses hex escapes for bytes outside the printable ASCII range.
Thus fffcff does correctly becomes \xff\xfc\xff, representing 3 bytes in hex escape notation, and 3000 is \x30\x00, but \x30 is the '0' character in ASCII, so the Python representation for that byte simply uses that ASCII character, as that is the most common way to interpret bytes.
Packing the integer value 40000 using struct.pack() as an unsigned integer (little endian) then becomes:
>>> import struct
>>> struct.pack('<I', 40000)
'#\x9c\x00\x00'
where the 40 byte is represented by the ASCII character for that byte, the # glyph.
If this is confusing, you can always create a new hex representation by going the other way and use 0binascii.hexlify() function](https://docs.python.org/2/library/binascii.html#binascii.hexlify) to create a hexadecimal representation for yourself, just to debug the output:
>>> import binascii
>>> binascii.hexlify(struct.pack('<I', 40000))
'409c0000'
and you'll see that the # byte is still the right hex value.

Writing binary data to a file in Python

I am trying to write data (text, floating point data) to a file in binary, which is to be read by another program later. The problem is that this program (in Fort95) is incredibly particular; each byte has to be in exactly the right place in order for the file to be read correctly. I've tried using Bytes objects and .encode() to write, but haven't had much luck (I can tell from the file size that it is writing extra bytes of data). Some code I've tried:
mgcnmbr='42'
bts=bytes(mgcnmbr)
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(bts)
test_file.close()
I've also tried:
mgcnmbr='42'
bts=mgcnmbr.encode(utf_32_le)
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(bts)
test_file.close()
To clarify, what I need is the integer value 42, written as a 4 byte binary. Next, I would write the numbers 1 and 0 in 4 byte binary. At that point, I should have exactly 12 bytes. Each is a 4 byte signed integer, written in binary. I'm pretty new to Python, and can't seem to get it to work out. Any suggestions? Soemthing like this? I need complete control over how many bytes each integer (and later, 4 byte floating point ) is.
Thanks
You need the struct module.
import struct
fout = open('test.dat', 'wb')
fout.write(struct.pack('>i', 42))
fout.write(struct.pack('>f', 2.71828182846))
fout.close()
The first argument in struct.pack is the format string.
The first character in the format string dictates the byte order or endianness of the data (Is the most significant or least significant byte stored first - big-endian or little-endian). Endianness varies from system to system. If ">" doesn't work try "<".
The second character in the format string is the data type. Unsurprisingly the "i" stands for integer and the "f" stands for float. The number of bytes is determined by the type. Shorts or "h's" for example are two bytes long. There are also codes for unsigned types. "H" corresponds to an unsigned short for instance.
The second argument in struct.pack is of course the value to be packed into the bytes object.
Here's the part where I tell you that I lied about a couple of things. First I said that the number of bytes is determined by the type. This is only partially true. The size of a given type is technically platform dependent as the C/C++ standard (which the struct module is based on) merely specifies minimum sizes. This leads me to the second lie. The first character in the format string also encodes whether the standard (minimum) number of bytes or the native (platform dependent) number of bytes is to be used. (Both ">" and "<" guarantee that the standard, minimum number of bytes is used which is in fact four in the case of an integer "i" or float "f".) It additionally encodes the alignment of the data.
The documentation on the struct module has tables for the format string parameters.
You can also pack multiple primitives into a single bytes object and realize the same result.
import struct
fout = open('test.dat', 'wb')
fout.write(struct.pack('>if', 42, 2.71828182846))
fout.close()
And you can of course parse binary data with struct.unpack.
Assuming that you want it in little-endian, you could do something like this to write 42 in a four byte binary.
test_file=open(PATH_HERE/test_file.dat','ab')
test_file.write(b'\xA2\0\0\0')
test_file.close()
A2 is 42 in hexadecimal, and the bytes '\xA2\0\0\0' makes the first byte equal to 42 followed by three empty bytes. This code writes the byte: 42, 0, 0, 0.
Your code writes the bytes to represent the character '4' in UTF 32 and the bytes to represent 2 in UTF 32. This means it writes the bytes: 52, 0, 0, 0, 50, 0, 0, 0, because each character is four bytes when encoded in UTF 32.
Also having a hex editor for debugging could be useful for you, then you could see the bytes that your program is outputting and not just the size.
In my problem Write binary string in binary file Python 3.4 I do like this:
file.write(bytes(chr(int(mgcnmbr)), 'iso8859-1'))

Categories