np.fromfile with count=-1 adds unexpected zeros - python

I am trying to use np.fromfile in order to read a binary file that I have written with Fortran using direct access. However if I set count=-1, instead of max_items, np.fromfile returns a larger array than expected; adding zeros to the vector I've written in binary.
Fortran test code:
program testing
implicit none
integer*4::i
open(1,access='DIRECT', recl=20, file='mytest', form='unformatted',convert='big_endian')
write(1,rec=1) (i,i=1,20)
close(1)
end program
How I am using np.fromfile:
import numpy as np
f=open('mytest','rb')
f.seek(0)
x=np.fromfile(f,'>i4',count=20)
print len(x),x
so if I use it like this it returns exactly my [1,...,20] np array, but setting count=-1 returns [1,...,20,0,0,0,0,0] with a size of 1600.
I am using a little endian machine (shouldn't affect anything) and I am compiling the Fortran code with ifort.
I am just curious about the reason this happens, to avoid any surprises in the future.

Related

trouble saving numpy array to matlab readable file

I have an image sequence as a numpy array;
Mov (15916, 480, 768)
dtype = int16
i've tried using Mov.tofile(filename)
this saves the array and I can load it again in python and view the images.
In matlab the images are corrupted after about 3000 frames.
Using the following also works but has the same problem when I retrieve the images in matlab;
fp = np.memmap(sbxpath, dtype='int16', mode='w+', shape=Mov.shape)
fp[:,:,:] = Mov[:,:,:]
If I use:
mv['mov'] = Mov
sio.savemat(sbxpath, mv)
I get the following error;
OverflowError: Python int too large to convert to C long
what am I doing wrong?
I'm sorry for this, because it is a beginners problem. Python saves variables as integers or floats depending on how they are initialized. Matlab defaults to 8 byte doubles. My matlab script expects doubles, my python script was outputting all kinds of variable types, so naturally things got messed up.

Read a float binary file into 2D arrays in python and matlab

I have some binary input files (extension ".bin") that describe a 2D field of ocean depth and which are all negative float numbers. I have been able to load them in matlab as follows:
f = fopen(filename,'r','b');
data = reshape(fread(f,'float32'),[128 64]);
This matlab code gives me double values between 0 and -5200. However, when I try to the same in Python, I strangely get values between 0 and 1e-37. The Python code is:
f = open(filename, 'rb')
data = np.fromfile(f, np.float32)
data.shape = (64,128)
The strange thing is that there is a mask value of 0 for land which shows up in the right places in the (64,128) array in both cases. It seems to just be the magnitude and sign of the numpy.float32 values that are off.
What am I doing wrong in the Python code?
numpy.fromfile isn't platform independant, especially the "byte-order" is mentioned in the documentation:
Do not rely on the combination of tofile and fromfile for data storage, as the binary files generated are are not platform independent. In particular, no byte-order or data-type information is saved.
You could try:
data = np.fromfile(f, '>f4') # big-endian float32
and:
data = np.fromfile(f, '<f4') # little-endian float32
and check which one (big endian or little endian) gives the correct values.
Base on your matlab fopen, the file is in big endian ('b'). But your python code does not take care of the endianness.

numpy.fromfile differences between python 2.7.3 and 2.7.6

I'm having an issue running code between two consoles and I've gotten it down to a difference between the versions of python installed on these computers (2.7.3 and 2.7.6 respectively).
Here is the input file found on github (https://github.com/tkkanno/PhD_work/blob/master/1r).
when in python 2.7.3 and numpy version 1.11.1 the following code works as expected:
import numpy as np
s = 'directory/to/file'
f = open(s, 'rb')
y = np.fromfile(f,'<l')
y.shape
this give gets an numpy array of shape (16384,). However, when it is run on python 2.7.6/numpy 1.11.1 it gives an array half the size (8192,). This isnt' acceptable for me
I can't understand why numpy is acting this way with different versions of python. I would be grateful for any suggestions
Converted from my comment:
You're likely running on different Python/OS builds with different notions of how big a long is. C doesn't require a specific long size, and in practice, Windows always treats it as 32 bits, while other common desktop OSes treat it as 32 bits if the OS & Python are built for 32 bit CPUs (ILP32), and 64 bits if built for 64 bit CPUs (LP64).
If you want a fixed width type on all OSes, don't use the system-dependent-width types. Use fixed width types instead. From your comments, the expected behavior is to load 32 bit/4 byte values. If you were just using native endianness, you could just pass numpy.int32 (numpy recognizes the raw class as datatypes). Since you want to specify endianness explicitly (perhaps this might run on a big endian system), you can instead pass '<i4' which explicitly states it's a little endian (<) signed integer (i) four bytes in size (4):
import numpy as np
s = 'directory/to/file'
with open(s, 'rb') as f: # Use with statements when opening files
y = np.fromfile(f, '<i4') # Use explicit fixed width type
y.shape

Unformatted input/output for Fortran

I'm trying to use FortranFile to get an output that I can use in my simulation code written in F95. I'm having troubles getting fortranfile to work properly. Maybe it's because I do not understand how it works. Here's my problem:
If I want to write a 1D array using FortranFile, it works fine:
nx = 128
bxo = np.zeros(nx, dtype=float)
bxo = something
import fortranfile as fofi
bxf=fofi.FortranFile('Fbx.dat',mode='w')
bxf.writeReals(bxo,prec='d')
bxf.close()
The above 1D version works like a charm. As soon as I try to do it for a 2D array, I get problems
nx = 128; ny = 128
bxo = np.zeros((nx,ny), dtype=float)
bxo = something
import fortranfile as fofi
bxf=fofi.FortranFile('Fbx.dat',mode='w')
bxf.writeReals(bxo,prec='d')
bxf.close()
When I try to do this, I get the following error:
---------------------------------------------------------------------------
error Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/site-packages/IPython/utils/py3compat.py in execfile(fname, *where)
173 else:
174 filename = fname
--> 175 __builtin__.execfile(filename, *where)
/Users/parashar/Dropbox/sandbox/test2d.py in <module>()
130 vyf=fofi.FortranFile('Fvy.dat',mode='w')
131 vzf=fofi.FortranFile('Fvz.dat',mode='w')
--> 132 bxf.writeReals(bxo,prec='d')
133 byf.writeReals(byo,prec='d')
134 bzf.writeReals(bzo,prec='d')
/Users/parashar/Dropbox/sandbox/fortranfile.py in writeReals(self, reals, prec)
215 _fmt = self.ENDIAN + prec
216 for r in reals:
--> 217 self.write(struct.pack(_fmt,r))
218 self._write_check(length_bytes)
219
error: required argument is not a float
Any ideas what could be going on?
Thanks!
I like Emmet's suggestion but given that my knowledge of python is very basic, it would be a lot of effort for a short goal. I just realized that I could handle the situation in a slightly different manner.
Direct access files in Fortran do not have the unnecessary leading/trailing information that the usual unformatted fortran files do. So the simplest way to deal with unformatted data exchanged between Fortran and Python is to treat the files as direct access in Fortran. Here is an example of how we can use data to/from python/fortran.
Python code:
import numpy as np
nx=128; ny=128;
bxo=np.zeros((nx,ny),dtype=float)
bxo=something
bxf=open('Fbx.dat',mode='wb')
np.transpose(bxo).tofile(bxf) # We transpose the array to map indices
# from python to fortran properly
bxo.close()
Fortran code:
program test
implicit none
double precision, dimension(128,128) :: bx
integer :: NNN, i, j
inquire(iolength=NNN) bx
open(unit=23,file='Fbx.dat',form='unformatted',status='old',&
access='direct',recl=NNN)
read(23,rec=1) bx
close(23)
! Write it out to a text file to test it
! by plotting in gnuplot
do i=1,128; do j=1,128
write(23,*) i,j,bx(i,j)
enddo; enddo
end
Because we are using the standard binary formate to read/write data, this method will work with arrays of any size unlike the FortranFile method.
I have realized that by sticking to direct access files in Fortran, we can have wider compatibility with other languages e.g. Python, IDL etc. This way we do not have to worry about the weird leading trailing markers, endian-ness etc.
I hope this will help someone else in my situation too.
I've never had to do anything with Fortran and Python at the same time, and know absolutely nothing about fortranfile.py, but my best guess is that fortranfile is not numpy-aware.
When you use 1D numpy arrays, or Python arrays, lists, etc. the iteration in the last part of the stack trace suggests that it's expecting an iterable of numbers ("for r in reals"), whereas when you try to serialize a 2D numpy array, I'm not sure what it gets (i.e. what "r" ends up being), maybe an iterable of iterators, or an iterable of 1D arrays. In short, 'r' is not just the number that's expected ("required argument is not a float"), but something else (like a 1D array, list, etc.).
I would try have a look and see if there's an alternative to writeReals() in fortranfile and, if not, hack one in that can handle 2D arrays with a bit of copypasta.
I would start by putting in a diagnostic print before that "self.write()" line (217) that tells you what 'r' actually is, since it isn't the expected float.

Write ascii output from numpy arrays

I want to write some random numbers into an ascii output file.
I generate the numbers with numpy, so the numbers are stored in numpy.array
import numpy as np
random1=np.random.uniform(-1.2,1.2,7e6)
random2=...
random3=...
All three array are of the same size.
I used standard file output, but this is really slow. Just about 8000 lines per 30 min. This may because I loop over three large arrays though.
fout1 = open("output.dat","w")
for i in range(len(random1)):
fout1.write(str(random1[i])+"\t"+ str(random2[i])+"\t"+ str(random3[i])+"\n")
fout1.close()
I also just used print str(random1[i])+"\t"+ str(random2[i])+"\t"+ str(random3[i]) and dumped everything in a file usind shell ./myprog.py > output.dat which seems a bit faster but still I am not satisfied with the output speed.
Any recommendations are really welcome.
Have you tried
random = np.vstack((random1, random2, random3)).T
random.savetxt("output.dat", delimiter="\t")
Im guessing the disk io is the most expensive operation you are doing.. You could try to create your own buffer to deal with this, instead of writing every line every loop buffer up say 100 lines and write them in one big block. Then experiment with this and see what the most benficial buffer size is

Categories