Python Struct to Unpack like C fread using C structures? - python

I am struggling to port a snippet of code from C (originally Fortran) to python to unpack the header of a binary file.
The C is:
fread(&hdr, hdrSize, 1, modelFile);
alb = hdr.fd[0];
skrc = hdr.fd[2];
taud = hdr.fd[16];
djul = hdr.fd[40];
deljul = hdr.fd[41];
n4 = hdr.id[3];
n5 = hdr.id[4];
n24 = hdr.id[5];
jdisk = hdr.id[11];
k4out = hdr.id[16];
j5 = hdr.id[39];
The file has been written using Fortran where, for example, alb is defined as (F10.2) and n4 is defined as (I10).
I tried using both the struct module, numpy.fromfile(), and numpy.fromstring(). I know the total header size, so using numpy.fromstring() I can read the binary (open(file, 'r+b') and generate an array that is significantly longer than expected - This is backfilled with 0.0 which I believe makes sense if the header is a fixed size and the fill data is smaller.
Is the struct module the way to go to access the data in the same manner that fread() does? In the C code, I believe that the .fd[0] indicates that we are looking at structure 0? If so, is that format duplicatable in Python?
Edit:
I am not able to find .id or .fd in the C code, but the original fortran defines them as
REAL*4 FD(80)
INTEGER*4 ID(40)
The rest of the port from Fortran to C is essentially verbatim, so I would assume that the declarations are identical.
Edit2:
The following appears to be working - in that the variable values are logical. This is using the struct module.
with open(sys.argv[1], 'r+b') as l:
headdata = l.read(HEAD_BYTES)
fd_arr = struct.unpack('f'*80, headdata[0:320])
print fd_arr[0]
print fd_arr[2]
print fd_arr[16]
print fd_arr[40]
print fd_arr[41]
id_arr = struct.unpack('i'*40, headdata[320:480])
print id_arr[3]
print id_arr[4]
print id_arr[5]
print id_arr[11]
print id_arr[16]
print id_arr[39]
I will explore using a single format string to see the impact of mixed types within the string.

Related

Python - Efficient way to flip bytes in a file?

I've got a folder full of very large files that need to be byte flipped by a power of 4. So essentially, I need to read the files as a binary, adjust the sequence of bits, and then write a new binary file with the bits adjusted.
In essence, what I'm trying to do is read a hex string hexString that looks like this:
"00112233AABBCCDD"
And write a file that looks like this:
"33221100DDCCBBAA"
(i.e. every two characters is a byte, and I need to flip the bytes by a power of 4)
I am very new to python and coding in general, and the way I am currently accomplishing this task is extremely inefficient. My code currently looks like this:
import binascii
with open(myFile, 'rb') as f:
content = f.read()
hexString = str(binascii.hexlify(content))
flippedBytes = ""
inc = 0
while inc < len(hexString):
flippedBytes += file[inc + 6:inc + 8]
flippedBytes += file[inc + 4:inc + 6]
flippedBytes += file[inc + 2:inc + 4]
flippedBytes += file[inc:inc + 2]
inc += 8
..... write the flippedBytes to file, etc
The code I pasted above accurately accomplishes what I need (note, my actual code has a few extra lines of: "hexString.replace()" to remove unnecessary hex characters - but I've left those out to make the above easier to read). My ultimate problem is that it takes EXTREMELY long to run my code with larger files. Some of my files I need to flip are almost 2gb in size, and the code was going to take almost half a day to complete one single file. I've got dozens of files I need to run this on, so that timeframe simply isn't practical.
Is there a more efficient way to flip the HEX values in a file by a power of 4?
.... for what it's worth, there is a tool called WinHEX that can do this manually, and only takes a minute max to flip the whole file.... I was just hoping to automate this with python so we didn't have to manually use WinHEX each time
You want to convert your 4-byte integers from little-endian to big-endian, or vice-versa. You can use the struct module for that:
import struct
with open(myfile, 'rb') as infile, open(myoutput, 'wb') as of:
while True:
d = infile.read(4)
if not d:
break
le = struct.unpack('<I', d)
be = struct.pack('>I', *le)
of.write(be)
Here is a little struct awesomeness to get you started:
>>> import struct
>>> s = b'\x00\x11\x22\x33\xAA\xBB\xCC\xDD'
>>> a, b = struct.unpack('<II', s)
>>> s = struct.pack('>II', a, b)
>>> ''.join([format(x, '02x') for x in s])
'33221100ddccbbaa'
To do this at full speed for a large input, use struct.iter_unpack

Reading A Binary File In Fortran That Was Created By A Python Code

I have a binary file that was created using a Python code. This code mainly scripts a bunch of tasks to pre-process a set of data files. I would now like to read this binary file in Fortran. The content of the binary file is coordinates of points in a simple format e.g.: number of points, x0, y0, z0, x1, y1, z1, ....
These binary files were created using the 'tofile' function in numpy. I have the following code in Fortran so far:
integer:: intValue
double precision:: dblValue
integer:: counter
integer:: check
open(unit=10, file='file.bin', form='unformatted', status='old', access='stream')
counter = 1
do
if ( counter == 1 ) then
read(unit=10, iostat=check) intValue
if ( check < 0 ) then
print*,"End Of File"
stop
else if ( check > 0 ) then
print*, "Error Detected"
stop
else if ( check == 0 ) then
counter = counter + 1
print*, intValue
end if
else if ( counter > 1 ) then
read(unit=10, iostat=check) dblValue
if ( check < 0 ) then
print*,"End Of File"
stop
else if ( check > 0 ) then
print*, "Error Detected"
stop
else if ( check == 0 ) then
counter = counter + 1
print*,dblValue
end if
end if
end do
close(unit=10)
This unfortunately does not work, and I get garbage numbers (e.g 6.4731191026611484E+212, 2.2844499004808491E-279 etc.). Could someone give some pointers on how to do this correctly?
Also what would be a good way of writing and reading binary files interchangeably between Python and Fortran - as it seems like that is going to be one of the requirements of my application.
Thanks
Here's a trivial example of how to take data generated with numpy to Fortran the binary way.
I calculated 360 values of sin on [0,2π),
#!/usr/bin/env python3
import numpy as np
with open('sin.dat', 'wb') as outfile:
np.sin(np.arange(0., 2*np.pi, np.pi/180.,
dtype=np.float32)).tofile(outfile)
exported that with tofile to binary file 'sin.dat', which has a size of 1440 bytes (360 * sizeof(float32)), read that file with this Fortran95 (gfortran -O3 -Wall -pedantic) program which outputs 1. - (val**2 + cos(x)**2) for x in [0,2π),
program numpy_import
integer, parameter :: REAL_KIND = 4
integer, parameter :: UNIT = 10
integer, parameter :: SAMPLE_LENGTH = 360
real(REAL_KIND), parameter :: PI = acos(-1.)
real(REAL_KIND), parameter :: DPHI = PI/180.
real(REAL_KIND), dimension(0:SAMPLE_LENGTH-1) :: arr
real(REAL_KIND) :: r
integer :: i
open(UNIT, file="sin.dat", form='unformatted',&
access='direct', recl=4)
do i = 0,ubound(arr, 1)
read(UNIT, rec=i+1, err=100) arr(i)
end do
do i = 0,ubound(arr, 1)
r = 1. - (arr(i)**2. + cos(real(i*DPHI, REAL_KIND))**2)
write(*, '(F6.4, " ")', advance='no')&
real(int(r*1E6+1)/1E6, REAL_KIND)
end do
100 close(UNIT)
write(*,*)
end program numpy_import
thus if val == sin(x), the numeric result must in good approximation vanish for float32 types.
And indeed:
output:
360 x 0.0000
So thanks to this great community, from all the advise I got, and a little bit of tinkering around, I think I figured out a stable solution to this problem, and I wanted to share with you all this answer. I will provide a minimal example here, where I want to write a variable size array from Python into a binary file, and read it using Fortran. I am assuming that the number of rows numRows and number of columns numCols are also written along with the full array datatArray. The following Python script writeBin.py writes the file:
import numpy as np
# Read in the numRows and numCols value
# Read in the array values
numRowArr = np.array([numRows], dtype=np.float32)
numColArr = np.array([numCols], dtype=np.float32)
fileObj = open('pybin.bin', 'wb')
numRowArr.tofile(fileObj)
numColArr.tofile(fileObj)
for i in range(numRows):
lineArr = dataArray[i,:]
lineArr.tofile(fileObj)
fileObj.close()
Following this, the fortran code to read the array from the file can be programmed as follows:
program readBin
use iso_fortran_env
implicit none
integer:: nR, nC, i
real(kind=real32):: numRowVal, numColVal
real(kind=real32), dimension(:), allocatable:: rowData
real(kind=real32), dimension(:,:), allocatable:: fullData
open(unit=10,file='pybin.bin',form='unformatted',status='old',access='stream')
read(unit=10) numRowVal
nR = int(numRowVal)
read(unit=10) numColVal
nC = int(numColVal)
allocate(rowData(nC))
allocate(fullData(nR,nC))
do i = 1, nR
read(unit=10) rowData
fullData(i,:) = rowData(:)
end do
close(unit=10)
end program readBin
The main point that I gathered from the discussion on this thread is to match the read and the write as much as possible, with precise specifications of the data types to be read, the way they are written etc. As you may note, this is a made up example, so there may be some things here and there that are not perfect. However, I have used this now to program a finite element program, and the mesh data was where I used this binary read/write - and it worked very well.
P.S: In case you find some typo, please let me know, and I will edit it out rightaway.
Thanks a lot.

Huge difference between python and fortran difference for a small program

I am reading a file containing single precision data with 512**3 data points. Based on a threshold, I assign each point a flag of 1 or 0. I wrote two programs doing the same thing, one in fortran, the other in python. But the one in fortran takes like 0.1 sec while the one in python takes minutes. Is it normal? Or can you please point out the problem with my python program:
fortran.f
program vorticity_tracking
implicit none
integer, parameter :: length = 512**3
integer, parameter :: threshold = 1320.0
character(255) :: filen
real, dimension(length) :: stored_data
integer, dimension(length) :: flag
integer index
filen = "vor.dat"
print *, "Reading the file ", trim(filen)
open(10, file=trim(filen),form="unformatted",
& access="direct", recl = length*4)
read (10, rec=1) stored_data
close(10)
do index = 1, length
if (stored_data(index).ge.threshold) then
flag(index) = 1
else
flag(index) = 0
end if
end do
stop
end program
Python file:
#!/usr/bin/env python
import struct
import numpy as np
f_type = 'float32'
length = 512**3
threshold = 1320.0
file = 'vor_00000_455.float'
f = open(file,'rb')
data = np.fromfile(f, dtype=f_type, count=-1)
f.close()
flag = []
for index in range(length):
if (data[index] >= threshold):
flag.append(1)
else:
flag.append(0)
********* Edit ******
Thanks for your comments. I am not sure then how to do this in fortran. I tried the following but this is still as slow.
flag = np.ndarray(length, dtype=np.bool)
for index in range(length):
if (data[index] >= threshold):
flag[index] = 1
else:
flag[index] = 0
Can anyone please show me?
Your two programs are totally different. Your Python code repeatedly changes the size of a structure. Your Fortran code does not. You're not comparing two languages, you're comparing two algorithms and one of them is obviously inferior.
In general Python is an interpreted language while Fortran is a compiled one. Therefore you have some overhead in Python. But it shouldn't take that long.
One thing that can be improved in the python version is to replace the for loop by an index operation.
#create flag filled with zeros with same shape as data
flag=numpy.zeros(data.shape)
#get bool array stating where data>=threshold
barray=data>=threshold
#everywhere where barray==True put a 1 in flag
flag[barray]=1
shorter version:
#create flag filled with zeros with same shape as data
flag=numpy.zeros(data.shape)
#combine the two operations without temporary barray
flag[data>=threshold]=1
Try this for python:
flag = data > threshhold
It will give you an array of flags as you want.

byte reverse AB CD to CD AB with python

I have a .bin file, and I want to simply byte reverse the hex data. Say for instance # 0x10 it reads AD DE DE C0, want it to read DE AD C0 DE.
I know there is a simple way to do this, but I am am beginner and just learning python and am trying to make a few simple programs to help me through my daily tasks. I would like to convert the whole file this way, not just 0x10.
I will be converting at start offset 0x000000 and blocksize/length is 1000000.
here is my code, maybe you can tell me what to do. i am sure i am just not getting it, and i am new to programming and python. if you could help me i would very much appreciate it.
def main():
infile = open("file.bin", "rb")
new_pos = int("0x000000", 16)
chunk = int("1000000", 16)
data = infile.read(chunk)
reverse(data)
def reverse(data):
output(data)
def output(data):
with open("reversed", "wb") as outfile:
outfile.write(data)
main()
and you can see the module for reversing, i have tried many different suggestions and it will either pass the file through untouched, or it will throw errors. i know module reverse is empty now, but i have tried all kinds of things. i just need module reverse to convert AB CD to CD AB.
thanks for any input
EDIT: the file is 16 MB and i want to reverse the byte order of the whole file.
In Python 3.4 you can use this:
>>> data = b'\xAD\xDE\xDE\xC0'
>>> swap_data = bytearray(data)
>>> swap_data.reverse()
the result is
bytearray(b'\xc0\xde\xde\xad')
In Python 2, the binary file gets read as a string, so string slicing should easily handle the swapping of adjacent bytes:
>>> original = '\xAD\xDE\xDE\xC0'
>>> ''.join([c for t in zip(original[1::2], original[::2]) for c in t])
'\xde\xad\xc0\xde'
In Python 3, the binary file gets read as bytes. Only a small modification is need to build another array of bytes:
>>> original = b'\xAD\xDE\xDE\xC0'
>>> bytes([c for t in zip(original[1::2], original[::2]) for c in t])
b'\xde\xad\xc0\xde'
You could also use the < and > endianess format codes in the struct module to achieve the same result:
>>> struct.pack('<2h', *struct.unpack('>2h', original))
'\xde\xad\xc0\xde'
Happy byte swapping :-)
data = b'\xAD\xDE\xDE\xC0'
reversed_data = data[::-1]
print(reversed_data)
# b'\xc0\xde\xde\xad'
Python3
bytes(reversed(b'\xAD\xDE\xDE\xC0'))
# b'\xc0\xde\xde\xad'
Python has a list operator to reverse the values of a list --> nameOfList[::-1]
So, I might store the hex values as string and put them into a list then try something like:
def reverseList(aList):
rev = aList[::-1]
outString = ""
for el in rev:
outString += el + " "
return outString

export matlab variable to text for python usage

So let's start off by saying I'm a total beginner in matlab. I'm working with python and now I've recieved some data in a matlab file that I need to export to a format I could use with python.
I've googled around and found I can export a matlab variable to a text file using:
dlmwrite('my_text', MyVariable, 'delimiter' , ',');
Now the variable I need to export is a 16000 x 4000 matrix of doubles of the form 0.006747668446927. Now here is where the problem starts. I need to export the full values for each double. Trying with that function lead me to export the numbers in a format of 0.0067477. This won't do since I need a whole lot more of precision for what I'm doing. So how can I export the full values of each of these variables? Or if you have a more elegant way of using that huge matlab matrix in python please feel free.
Regards,
Bogdan
To exchange big chunks of numerical data between Python and Matlab I
recommend HDF5
http://en.wikipedia.org/wiki/Hierarchical_Data_Format
The Python binding is called h5py
http://code.google.com/p/h5py
Here are two examples for both directions. First from
Matlab to Python
% matlab
points = [1 2 3 ; 4 5 6 ; 7 8 9 ; 10 11 12 ];
hdf5write('test.h5', '/Points', points);
# python
import h5py
with h5py.File('test.h5', 'r') as f:
points = f['/Points'].value
And now from Python to Matlab
# python
import h5py
import numpy
points = numpy.array([ [1., 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12] ])
with h5py.File('test.h5', 'w') as f:
f['/Points'] = points
% matlab
points = hdf5read('test.h5', '/Points');
NOTE A column in Matlab will come out as a row in Python and vice versa. This isn't a bug but the difference between the way C and Fortran interpret a continuous piece of data in memory.
Scipy has tools for reading MATLAB .mat files natively: see e.g. http://www.janeriksolem.net/2009/05/reading-and-writing-mat-files-with.html
While I like the hdf5 based answer, I still think text files and CSVs are nice for smaller things (you can open them in text editors, spreadsheets whatever). In that case I would use MATLABs fopen/fprintf/fclose rather than dlmwrite - I like to make things explicit. Then again, this dlmwrite might be better for multi-dimensional arrays.
You can simply write your variable to file as binary data, then read it in any language you want, be it MATLAB, Python, C, etc.. Example:
MATLAB (write)
X = rand([100 1],'single');
fid = fopen('file.bin', 'wb');
count = fwrite(fid, X, 'single');
fclose(fid);
MATLAB (read)
fid = fopen('file.bin', 'rb');
data = fread(fid, Inf, 'single=>single');
fclose(fid);
Python (read)
import struct
data = []
f = open("file.bin", "rb")
try:
# read 4 bytes at a time (float)
bytes = f.read(4) # returns a sequence of bytes as a string
while bytes != "":
# string byte-sequence to float
num = struct.unpack('f',bytes)[0]
# append to list
data.append(num);
# read next 4 bytes
bytes = f.read(4)
finally:
f.close()
# print list
print data
C (read)
#include <stdio.h>
#include <stdlib.h>
int main()
{
FILE *fp = fopen("file.bin", "rb");
// Determine size of file
fseek(fp, 0, SEEK_END);
long int lsize = ftell(fp);
rewind(fp);
// Allocate memory, and read file
float *numbers = (float*) malloc(lsize);
size_t count = fread(numbers, 1, lsize, fp);
fclose(fp);
// print data
int i;
int numFloats = lsize / sizeof(float);
for (i=0; i<numFloats; i+=1) {
printf("%g\n", numbers[i]);
}
return 0;
}

Categories