Python: Read and write binary data - python

I am aware that there are a lot of almost identical questions, but non seems to really target the general case.
So assume I want to open a file, read it in memory, possibly do some operations on the respective bitstring and write the result back to file.
The following is what seems straightforward to me, but it results in completely different output. Note that for simplicity I only copy the file here:
file = open('INPUT','rb')
data = file.read()
data_16 = data.encode('hex')
data_2 = bin(int(data_16,16))
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
I looked at data, data_16 and data_2. The transformations make sense as far as I can see.
As expected, the output file has exactly the same size in bits as the input file.
EDIT: I considered the possibility that the leading '0b' has to be cut. See the following:
>>> data[:100]
'BMFU"\x00\x00\x00\x00\x006\x00\x00\x00(\x00\x00\x00\xe8\x03\x00\x00\xee\x02\x00\x00\x01\x00\x18\x00\x00\x00\x00\x00\x00\x00\x00\x00\x12\x0b\x00\x00\x12\x0b\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x05=o\xce\xf4^\x16\xe0\x80\x92\x00\x00\x00\x01I\x02\x1d\xb5\x81\xcaN\xcb\xb8\x91\xc3\xc6T\xef\xcb\xe1j\x06\xc3;\x0c*\xb9Q\xbc\xff\xf6\xff\xff\xf7\xed\xdf'
>>> data_16[:100]
'424d46552200000000003600000028000000e8030000ee020000010018000000000000000000120b0000120b000000000000'
>>> data_2[:100]
'0b10000100100110101000110010101010010001000000000000000000000000000000000000000000011011000000000000'
>>> data_2[1]
'b'
Maybe the BMFU" part should be cut from data?

>>> bin(25)
'0b11001'
Note two things:
The "0b" at the beginning. This means that your slicing will be off by 2 bits.
The lack of padding to 8 bits. This will corrupt your data every time unless it happens to mesh up with point 1.
Process the file byte by byte instead of attempting to process it in one big gulp like this. If you find your code too slow then you need to find a faster way of working byte by byte, not switch to an irreparably flawed method such as this one.

You could simply write the data variable back out and you'd have a successful round trip.
But it looks like you intend to work on the file as a string of 0 and 1 characters. Nothing wrong with that (though it's rarely necessary), but your code takes a very roundabout way of converting the data to that form. Instead of building a monster integer and converting it to a bit string, just do so for one byte at a time:
data = file.read()
data_2 = "".join( bin(ord(c))[2:] for c in data )
data_2 is now a sequence of zeros and ones. (In a single string, same as you have it; but if you'll be making changes, I'd keep the bitstrings in a list). The reverse conversion is also best done byte by byte:
newdata = "".join(chr(int(byte, 8)) for byte in grouper(long_bitstring, 8, "0"))
This uses the grouper recipe from the itertools documentation.
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)

You can use the struct module to read and write binary data. (Link to the doc here.)
EDIT
Sorry, I was mislead by your title. I’ve just understand that you write binary data in a text file instead of writing binary data directly.

Ok, thanks to alexis and being aware of Ignacio's warning about the padding, I found a way to do what I wanted to do, that is read data into a binary representation and write a binary representation to file:
def padd(bitstring):
padding = ''
for i in range(8-len(bitstring)):
padding += '0'
bitstring = padding + bitstring
return bitstring
file = open('INPUT','rb')
data = file.read()
data_2 = "".join( padd(bin(ord(c))[2:]) for c in data )
OUT = open('OUTPUT','wb')
i = 0
while i < len(data_2) / 8:
byte = int(data_2[i*8 : (i+1)*8], 2)
OUT.write('%c' % byte)
i += 1
OUT.close()
If I did not do it exactly the way proposed by alexis then that is because it did not work. Of course this is terribly slow but now that I can do the simplest thing, I can optimize it further.

Related

Python f.read() and Octave fread(). => Reading a binary file showing the same values

I'm reading a binary file with signal samples both in Octave and Python.
The thing is, I want to obtain the same values for both codes, which is not the case.
The binary file is basically a signal in complex format I,Q recorded as a 16bits Int.
So, based on the Octave code:
[data, cnt_data] = fread(fid, 2 * secondOfData * fs, 'int16');
and then:
data = data(1:2:end) + 1i * data(2:2:end);
It seems simple, just reading the binary data as 16 bits ints. And then creating the final array of complex numbers.
Threfore I assume that in Python I need to do as follows:
rel=int(f.read(2).encode("hex"),16)
img=int(f.read(2).encode("hex"),16)
in_clean.append(complex(rel,img))
Ok, the main problem I have is that both real and imaginary parts values are not the same.
For instance, in Octave, the first value is: -20390 - 10053i
While in Python (applying the code above), the value is: (23216+48088j)
As signs are different, the first thing I thought was that maybe the endianness of the computer that recorded the file and the one I'm using for reading the file are different. So I turned to unpack function, as it allows you to force the endian type.
I was not able to find an "int16" in the unpack documentation:
https://docs.python.org/2/library/struct.html
Therefore I went for the "i" option adding "x" (padding bytes) in order to meet the requirement of 32 bits from the table in the "struct" documentation.
So with:
struct.unpack("i","xx"+f.read(2))[0]
the result is (-1336248200-658802568j) Using
struct.unpack("<i","xx"+f.read(2))[0] provides the same result.
With:
struct.unpack(">i","xx"+f.read(2))[0]
The value is: (2021153456+2021178328j)
With:
struct.unpack(">i",f.read(2)+"xx")[0]
The value is: (1521514616-1143441288j)
With:
struct.unpack("<i",f.read(2)+"xx")[0]
The value is: (2021175386+2021185723j)
I also tried with numpy and "frombuffer":
np.frombuffer(f.read(1).encode("hex"),dtype=np.int16)
With provides: (24885+12386j)
So, any idea about what I'm doing wrong? I'd like to obtain the same value as in Octave.
What is the proper way of reading and interpreting the values in Python so I can obtain the same value as in Octave by applying fread with an'int16'?
I've been searching on the Internet for an answer for this but I was not able to find a method that provides the same value
Thanks a lot
Best regards
It looks like the binary data in your question is 5ab0bbd8. To unpack signed 16 bit integers with struct.unpack, you use the 'h' format character. From that (23216+48088j) output, it appears that the data is encoded as little-endian, so we need to use < as the first item in the format string.
from struct import unpack
data = b'\x5a\xb0\xbb\xd8'
# The wrong way
rel=int(data[:2].encode("hex"),16)
img=int(data[2:].encode("hex"),16)
c = complex(rel, img)
print c
# The right way
rel, img = unpack('<hh', data)
c = complex(rel, img)
print c
output
(23216+48088j)
(-20390-10053j)
Note that rel, img = unpack('<hh', data) will also work correctly on Python 3.
FWIW, in Python 3, you could also decode 2 bytes to a signed integer like this:
def int16_bytes_to_int(b):
n = int.from_bytes(b, 'little')
if n > 0x7fff:
n -= 0x10000
return n
The rough equivalent in Python 2 is:
def int16_bytes_to_int(b):
lo, hi = b
n = (ord(hi) << 8) + ord(lo)
if n > 0x7fff:
n -= 0x10000
return n
But having to do that subtraction to handle signed numbers is annoying, and using struct.unpack is bound to be much more efficient.

Python3 reading a binary file, 4 bytes at a time and xor it with a 4 byte long key

I want to read a binary file, get the content four bytes by four bytes and perform int operations on these packets.
Using a dummy binary file, opened this way:
with open('MEM_10001000_0000B000.mem', 'br') as f:
for byte in f.read():
print (hex(byte))
I want to perform an encryption with a 4 byte long key, 0x9485A347 for example.
Is there a simple way I can read my files 4 bytes at a time and get them as int or do I need to put them in a temporary result using a counter?
My original idea is the following:
current_tmp = []
for byte in data:
current_tmp.append(int(byte))
if (len(current_tmp) == 4):
print (current_tmp)
# but current_tmp is an array not a single int
current_tmp = []
In my example, instead of having [132, 4, 240, 215] I would rather have 0x8404f0d7
Just use the "amount" parameter of read to read 4 bytes at a time, and the "from_bytes" constructor of Python's 3 int to get it going:
with open('MEM_10001000_0000B000.mem', 'br') as f:
data = f.read(4)
while data:
number = int.from_bytes(data, "big")
...
data = f.read(4)
If you are not using Python 3 yet for some reason, int won't feature a from_bytes method - then you could resort to use the struct module:
import struct
...
number = struct.unpack(">i", data)[0]
...
These methods however are good for a couple interations, and could get slow for a large file - Python offers a way for you to simply fill an array of 4-byte integer numbers directly in memory from an openfile - which is more likely what you should be using:
import array, os
numbers = array.array("i")
with open('MEM_10001000_0000B000.mem', 'br') as f:
numbers.fromfile(f, os.stat('MEM_10001000_0000B000.mem').st_size // numbers.itemsize)
numbers.byteswap()
Once you have the array, you can xor it with something like
from functools import reduce #not needed in Python2.7
result = reduce(lambda result, input: result ^ input, numbers, key)
will give you a numbers sequence with all numbers in your file read-in as 4 byte, big endian, signed ints.
If you file is not a multiple of 4 bytes, the first two methods might need some adjustment - fixing the while condition will be enough.

simple reading of fortran binary data not so simple in python

I have a binary output file from a FORTRAN code. Want to read it in python. (Reading with FORTRAN and outputting text to read for python is not an option. Long story.) I can read the first record in a simplistic manner:
>>> binfile=open('myfile','rb')
>>> pad1=struct.unpack('i',binfile.read(4))[0]
>>> ver=struct.unpack('d',binfile.read(8))[0]
>>> pad2=struct.unpack('i',binfile.read(4))[0]
>>> pad1,ver,pad2
(8,3.13,8)
Just fine. But this is a big file and I need to do this more efficiently. So I try:
>>> (pad1,ver,pad2)=struct.unpack('idi',binfile.read(16))
This won't run. Gives me an error and tells me that unpack needs an argument with a length of 20. This makes no sense to me since the last time I checked, 4+8+4=16. When I give in and replace the 16 with 20, it runs, but the three numbers are populated with numerical junk. Does anyone see what I am doing wrong? Thanks!
The size you get is due to alignment, try struct.calcsize('idi') to verify the size is actually 20 after alignment. To use the native byte-order without alignment, specify struct.calcsize('=idi') and adapt it to your example.
For more info on the struct module, check http://docs.python.org/2/library/struct.html
The struct module is mainly intended to interoperate with C structures and because of this it aligns the data members. idi corresponds to the following C structure:
struct
{
int int1;
double double1;
int int2;
}
double entries require 8 byte alignment in order to function efficiently (or even correctly) with most CPU load operations. That's why 4 bytes of padding are being added between int1 and double1, which increases the size of the structure to 20 bytes. The same padding is performed by the struct module, unless you suppress the padding by adding < (on little endian machines) or > (on big endian machines), or simply = at the beginning of the format string:
>>> struct.unpack('idi', d)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
struct.error: unpack requires a string argument of length 20
>>> struct.unpack('<idi', d)
(-1345385859, 2038.0682530887993, 428226400)
>>> struct.unpack('=idi', d)
(-1345385859, 2038.0682530887993, 428226400)
(d is a string of 16 random chars.)
I recommend using arrays to read a file that was written by FORTRAN with UNFORMATTED, SEQUENTIAL.
Your specific example using arrays, would be as follows:
import array
binfile=open('myfile','rb')
pad = array.array('i')
ver = array.array('d')
pad.fromfile(binfile,1) # read the length of the record
ver.fromfile(binfile,1) # read the actual data written by FORTRAN
pad.fromfile(binfile,1) # read the length of the record
If you have FORTRAN records that write arrays of integers and doubles, which is very common, your python would look something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_integers = array.array('i')
my_floats = array.array('d')
number_of_integers = 1000 # replace with how many you need to read
number_of_floats = 10000 # replace with how many you need to read
pad.fromfile(binfile,1) # read the length of the record
my_integers.fromfile(binfile,number_of_integers) # read the integer data
my_floats.fromfile(binfile,number_of_floats) # read the double data
pad.fromfile(binfile,1) # read the length of the record
Final comment is that if you have characters on the file, you can read those into an array as well, and then decode it into a string. Something like this:
import array
binfile=open('myfile','rb')
pad = array.array('i')
my_characters = array.array('B')
number_of_characters = 63 # replace with number of characters to read
pad.fromfile(binfile,1) # read the length of the record
my_characters.fromfile(binfile,number_of_characters ) # read the data
my_string = my_characters.tobytes().decode(encoding='utf_8')
pad.fromfile(binfile,1) # read the length of the record

Getting Raw Binary Representation of a file in Python

I'd like to get the exact sequence of bits from a file into a string using Python 3. There are several questions on this topic which come close, but don't quite answer it. So far, I have this:
>>> data = open('file.bin', 'rb').read()
>>> data
'\xa1\xa7\xda4\x86G\xa0!e\xab7M\xce\xd4\xf9\x0e\x99\xce\xe94Y3\x1d\xb7\xa3d\xf9\x92\xd9\xa8\xca\x05\x0f$\xb3\xcd*\xbfT\xbb\x8d\x801\xfanX\x1e\xb4^\xa7l\xe3=\xaf\x89\x86\xaf\x0e8\xeeL\xcd|*5\xf16\xe4\xf6a\xf5\xc4\xf5\xb0\xfc;\xf3\xb5\xb3/\x9a5\xee+\xc5^\xf5\xfe\xaf]\xf7.X\x81\xf3\x14\xe9\x9fK\xf6d\xefK\x8e\xff\x00\x9a>\xe7\xea\xc8\x1b\xc1\x8c\xff\x00D>\xb8\xff\x00\x9c9...'
>>> bin(data[:][0])
'0b11111111'
OK, I can get a base-2 number, but I don't understand why data[:][x], and I still have the leading 0b. It would also seem that I have to loop through the whole string and do some casting and parsing to get the correct output. Is there a simpler way to just get the sequence of 01's without looping, parsing, and concatenating strings?
Thanks in advance!
I would first precompute the string representation for all values 0..255
bytetable = [("00000000"+bin(x)[2:])[-8:] for x in range(256)]
or, if you prefer bits in LSB to MSB order
bytetable = [("00000000"+bin(x)[2:])[-1:-9:-1] for x in range(256)]
then the whole file in binary can be obtained with
binrep = "".join(bytetable[x] for x in open("file", "rb").read())
If you are OK using an external module, this uses bitstring:
>>> import bitstring
>>> bitstring.BitArray(filename='file.bin').bin
'110000101010000111000010101001111100...'
and that's it. It just makes the binary string representation of the whole file.
It is not quite clear what the sequence of bits is meant to be. I think it would be most natural to start at byte 0 with bit 0, but it actually depends on what you want.
So here is some code to access the sequence of bits starting with bit 0 in byte 0:
def bits_from_char(c):
i = ord(c)
for dummy in range(8):
yield i & 1
i >>= 1
def bits_from_data(data):
for c in data:
for bit in bits_from_char(c):
yield bit
for bit in bits_from_data(data):
# process bit
(Another note: you would not need data[:][0] in your code. Simply data[0] would do the trick, but without copying the whole string first.)
To convert raw binary data such as b'\xa1\xa7\xda4\x86' into a bitstring that represents the data as a number in binary system (base-2) in Python 3:
>>> data = open('file.bin', 'rb').read()
>>> bin(int.from_bytes(data, 'big'))[2:]
'1010000110100111110110100011010010000110...'
See Convert binary to ASCII and vice versa.

What is the equivalent of 'fread' from Matlab in Python?

I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)
Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.
From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.
The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.

Categories