I have a binary output file from a FORTRAN code. Want to read it in python. (Reading with FORTRAN and outputting text to read for python is not an option. Long story.) I can read the first record in a simplistic manner:
>>> binfile=open('myfile','rb')
>>> pad1=struct.unpack('i',binfile.read(4))[0]
>>> ver=struct.unpack('d',binfile.read(8))[0]
>>> pad2=struct.unpack('i',binfile.read(4))[0]
>>> pad1,ver,pad2
(8,3.13,8)
Just fine. But this is a big file and I need to do this more efficiently. So I try:
>>> (pad1,ver,pad2)=struct.unpack('idi',binfile.read(16))
This won't run. Gives me an error and tells me that unpack needs an argument with a length of 20. This makes no sense to me since the last time I checked, 4+8+4=16. When I give in and replace the 16 with 20, it runs, but the three numbers are populated with numerical junk. Does anyone see what I am doing wrong? Thanks!
The size you get is due to alignment, try struct.calcsize('idi') to verify the size is actually 20 after alignment. To use the native byte-order without alignment, specify struct.calcsize('=idi') and adapt it to your example.
For more info on the struct module, check http://docs.python.org/2/library/struct.html
The struct module is mainly intended to interoperate with C structures and because of this it aligns the data members. idi corresponds to the following C structure:
struct
{
int int1;
double double1;
int int2;
}
double entries require 8 byte alignment in order to function efficiently (or even correctly) with most CPU load operations. That's why 4 bytes of padding are being added between int1 and double1, which increases the size of the structure to 20 bytes. The same padding is performed by the struct module, unless you suppress the padding by adding < (on little endian machines) or > (on big endian machines), or simply = at the beginning of the format string:
>>> struct.unpack('idi', d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
struct.error: unpack requires a string argument of length 20
>>> struct.unpack('<idi', d)
(-1345385859, 2038.0682530887993, 428226400)
>>> struct.unpack('=idi', d)
(-1345385859, 2038.0682530887993, 428226400)
(d is a string of 16 random chars.)
I recommend using arrays to read a file that was written by FORTRAN with UNFORMATTED, SEQUENTIAL.
Your specific example using arrays, would be as follows:
import array
binfile=open('myfile','rb')
pad = array.array('i')
ver = array.array('d')
pad.fromfile(binfile,1) # read the length of the record
ver.fromfile(binfile,1) # read the actual data written by FORTRAN
pad.fromfile(binfile,1) # read the length of the record
If you have FORTRAN records that write arrays of integers and doubles, which is very common, your python would look something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_integers = array.array('i')
my_floats = array.array('d')
number_of_integers = 1000 # replace with how many you need to read
number_of_floats = 10000 # replace with how many you need to read
pad.fromfile(binfile,1) # read the length of the record
my_integers.fromfile(binfile,number_of_integers) # read the integer data
my_floats.fromfile(binfile,number_of_floats) # read the double data
pad.fromfile(binfile,1) # read the length of the record
Final comment is that if you have characters on the file, you can read those into an array as well, and then decode it into a string. Something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_characters = array.array('B')
number_of_characters = 63 # replace with number of characters to read
pad.fromfile(binfile,1) # read the length of the record
my_characters.fromfile(binfile,number_of_characters ) # read the data
my_string = my_characters.tobytes().decode(encoding='utf_8')
pad.fromfile(binfile,1) # read the length of the record
Related
I want to read a binary file, get the content four bytes by four bytes and perform int operations on these packets.
Using a dummy binary file, opened this way:
with open('MEM_10001000_0000B000.mem', 'br') as f:
for byte in f.read():
print (hex(byte))
I want to perform an encryption with a 4 byte long key, 0x9485A347 for example.
Is there a simple way I can read my files 4 bytes at a time and get them as int or do I need to put them in a temporary result using a counter?
My original idea is the following:
current_tmp = []
for byte in data:
current_tmp.append(int(byte))
if (len(current_tmp) == 4):
print (current_tmp)
# but current_tmp is an array not a single int
current_tmp = []
In my example, instead of having [132, 4, 240, 215] I would rather have 0x8404f0d7
Just use the "amount" parameter of read to read 4 bytes at a time, and the "from_bytes" constructor of Python's 3 int to get it going:
with open('MEM_10001000_0000B000.mem', 'br') as f:
data = f.read(4)
while data:
number = int.from_bytes(data, "big")
...
data = f.read(4)
If you are not using Python 3 yet for some reason, int won't feature a from_bytes method - then you could resort to use the struct module:
import struct
...
number = struct.unpack(">i", data)[0]
...
These methods however are good for a couple interations, and could get slow for a large file - Python offers a way for you to simply fill an array of 4-byte integer numbers directly in memory from an openfile - which is more likely what you should be using:
import array, os
numbers = array.array("i")
with open('MEM_10001000_0000B000.mem', 'br') as f:
numbers.fromfile(f, os.stat('MEM_10001000_0000B000.mem').st_size // numbers.itemsize)
numbers.byteswap()
Once you have the array, you can xor it with something like
from functools import reduce #not needed in Python2.7
result = reduce(lambda result, input: result ^ input, numbers, key)
will give you a numbers sequence with all numbers in your file read-in as 4 byte, big endian, signed ints.
If you file is not a multiple of 4 bytes, the first two methods might need some adjustment - fixing the while condition will be enough.
I have to read a binary file. So I have totally immersed myself in the python struct module.
However there are still things that confuse me. Let's consider the following chunk of code:
import struct
print struct.pack('5c', *'Hello')
to_pack = (5.9, 14.87, 'HEAD', 32321, 238, 99)
packed = struct.pack('2f4s3i', *to_pack)
print "packed: ", packed
output:
Hello
packed: �̼#��mAHEADA~�
I packed successively 2 floats, a 4 chars string, and three integers.
Then when unpacking:
unpacked = struct.unpack('2f4s3i', packed)
print "unpacked: ", unpacked
Output:
unpacked: (5.900000095367432, 14.869999885559082, 'HEAD', 32321, 238, 99)
So the packing function turned my original data into binary data, whilst unpacking did the
opposite. However, does it mean I necessarily have to know how my data is organized, do I
necessarily have to know which types are encoded, and their respective order?
What if I don't, how could I guess the right type order of my data? For instance if I do:
unpacked = struct.unpack('2f4s3h', packed) # I replaced the 3i with 3h
print "unpacked: ", unpacked
I would get a nice error:
unpacked = struct.unpack('2f4s3h', packed)
struct.error: unpack requires a string argument of length 18
So it seems to me that whatever the binary data I get when reading a binary file, if
I don't know the correct types in the right order, I could not convert it to its original
form.
Is there a way to convert the data back to non-binary without specifying the expected types,
or would I really be stuck with a non-usable binary file?
I mean, even among those creating huge binary files from gigantic ones, how woud they
manage to retrieve their data successfully?
For information, my example was taken from this pdf file: https://gebloggendings.files.wordpress.com/2012/07/struct.pdf
Yes, it's raw binary data, so you need to tell Python about its structure in order to usefully unpack it. Python doesn't know whether that 24-byte blob of data you created in packed is 6 floats, or 6 ints, or 3 doubles, or any combination of those, or something completely different.
>>> unpack('6f', packed)
(5.900000095367432, 14.869999885559082, 773.08251953125, 4.5291367665442413e-41, 3.3350903450930646e-43, 1.3872854796815689e-43)
>>> unpack('6i', packed)
(1086115021, 1097722757, 1145128264, 32321, 238, 99)
>>> unpack('3d', packed)
(15686698.023046875, 6.8585591728324e-310, 2.10077583423e-312)
>>> unpack('dfid', packed)
(15686698.023046875, 773.08251953125, 32321, 2.10077583423e-312)
I am aware that there are a lot of almost identical questions, but non seems to really target the general case.
So assume I want to open a file, read it in memory, possibly do some operations on the respective bitstring and write the result back to file.
The following is what seems straightforward to me, but it results in completely different output. Note that for simplicity I only copy the file here:
file = open('INPUT','rb')
data = file.read()
data_16 = data.encode('hex')
data_2 = bin(int(data_16,16))
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
I looked at data, data_16 and data_2. The transformations make sense as far as I can see.
As expected, the output file has exactly the same size in bits as the input file.
EDIT: I considered the possibility that the leading '0b' has to be cut. See the following:
>>> data[:100]
'BMFU"\x00\x00\x00\x00\x006\x00\x00\x00(\x00\x00\x00\xe8\x03\x00\x00\xee\x02\x00\x00\x01\x00\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x0b\x00\x00\x12\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05=o\xce\xf4^\x16\xe0\x80\x92\x00\x00\x00\x01I\x02\x1d\xb5\x81\xcaN\xcb\xb8\x91\xc3\xc6T\xef\xcb\xe1j\x06\xc3;\x0c*\xb9Q\xbc\xff\xf6\xff\xff\xf7\xed\xdf'
>>> data_16[:100]
'424d46552200000000003600000028000000e8030000ee020000010018000000000000000000120b0000120b000000000000'
>>> data_2[:100]
'0b10000100100110101000110010101010010001000000000000000000000000000000000000000000011011000000000000'
>>> data_2[1]
'b'
Maybe the BMFU" part should be cut from data?
>>> bin(25)
'0b11001'
Note two things:
The "0b" at the beginning. This means that your slicing will be off by 2 bits.
The lack of padding to 8 bits. This will corrupt your data every time unless it happens to mesh up with point 1.
Process the file byte by byte instead of attempting to process it in one big gulp like this. If you find your code too slow then you need to find a faster way of working byte by byte, not switch to an irreparably flawed method such as this one.
You could simply write the data variable back out and you'd have a successful round trip.
But it looks like you intend to work on the file as a string of 0 and 1 characters. Nothing wrong with that (though it's rarely necessary), but your code takes a very roundabout way of converting the data to that form. Instead of building a monster integer and converting it to a bit string, just do so for one byte at a time:
data = file.read()
data_2 = "".join( bin(ord(c))[2:] for c in data )
data_2 is now a sequence of zeros and ones. (In a single string, same as you have it; but if you'll be making changes, I'd keep the bitstrings in a list). The reverse conversion is also best done byte by byte:
newdata = "".join(chr(int(byte, 8)) for byte in grouper(long_bitstring, 8, "0"))
This uses the grouper recipe from the itertools documentation.
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
You can use the struct module to read and write binary data. (Link to the doc here.)
EDIT
Sorry, I was mislead by your title. I’ve just understand that you write binary data in a text file instead of writing binary data directly.
Ok, thanks to alexis and being aware of Ignacio's warning about the padding, I found a way to do what I wanted to do, that is read data into a binary representation and write a binary representation to file:
def padd(bitstring):
padding = ''
for i in range(8-len(bitstring)):
padding += '0'
bitstring = padding + bitstring
return bitstring
file = open('INPUT','rb')
data = file.read()
data_2 = "".join( padd(bin(ord(c))[2:]) for c in data )
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
If I did not do it exactly the way proposed by alexis then that is because it did not work. Of course this is terribly slow but now that I can do the simplest thing, I can optimize it further.
I need to store a binary file with a 12 byte header composed of 4 fields. They are namely: sSamples (4-bytes integer), sSampPeriod (4-bytes integer), sSampSize (2-bytes integer), and finally sParmKind (2-bytes integer).
I'm using 'struct' to my variables to the desired fields. Now that I have them defined separately, how could I merge them all to store the '12 bytes header'?
sSamples = struct.pack('i', nSamples) # 4-bytes integer
sSampPeriod = struct.pack('i', nSampPeriod) # 4-bytes integer
sSampSize = struct.pack('H', nSampSize) # 2-bytes integer / unsigned short
sParmKind = struct.pack('H', 9) # 2-bytes integer / unsigned short
In addition, I've a npVect float array of dimensionality D (numpy.ndarray - float32). How could I store this vector in the same binary file, but after the header?
As Cody Brocious wrote, you can pack your entire header at once:
header = struct.pack('<iiHH', nSamples, nSampPeriod, nSampSize, nParmKind)
He also mentioned endianness, which is important if you want to pack your data so as to reliably unpack it on machines with different architectures. The < at the beginning of my format string specifies "pack this data using a little-endian convention".
As for the array, you'll have to pack its length in order to determine how many values to unpack when you read it again. Doing it all in one call:
flattened = npVect.ravel() # get a 1-D array of numbers
arrSize = len(flattened)
# pack header, count of numbers, and numbers, all in one call
packed = struct.pack('<iiHHi%df' % arrSize,
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize, *flattened)
Depending on how big your array is likely to be, you could end up with a huge string representing the entire contents of your binary file, and you might want to look into alternatives to struct which don't require you to have the entire file in memory.
Unpacking:
fmt = '<iiHHi'
nSamples, nSampPeriod, nSampSize, nParmKind, arrSize = struct.unpack(fmt, packed)
# Use unpack_from to start reading after the packed header and count
flattened = struct.unpack_from('<%df' % arrSize, packed, struct.calcsize(fmt))
npVect = np.ndarray(flattened, dtype='float32').reshape(# your dimensions go here
)
EDIT: Oops, the array format isn't quite as simple as that :) The general idea holds, though: flatten your array into a list of numbers using any method you like, pack the number of values, then pack each value. On the other side, read the array as a flat list, then impose whatever structure you need on it.
EDIT: Changed format strings to use repeat specifiers, rather than string multiplication. Thanks to John Machin for pointing it out.
EDIT: Added numpy code to flatten the array before packing and reconstruct it after unpacking.
struct.pack returns a string, so you can combine the fields simply by string concatenation:
header = sSamples + sSampPeriod + sSampSize + sParmKind
assert len( header ) == 12
I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)
Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.
From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.
The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.