How to read a compressed binary file as an array of floats - python

I need to read a compressed unformatted binary file in as an array of floats. The only way that I have found to do this, is to use os to unzip, read it using np.fromfile, and then zip it up again.
os.system('gunzip filename.gz')
array = np.fromfile('filename','f4')
os.system('gzip filename')
However, this is not acceptable. Apart from being messy, I need to read files when I don't have write permission. I understand that np.fromfile can not directly read a compressed file. I have found people recommending that I use this:
f=gzip.GzipFile('filename')
file_content = f.read()
But this returns something like this: '\x00\x80\xe69\x00\x80\xd19\x00\x80'
instead of an array of floats. Does anyone know how to convert this output into an array of floats, or have a better way to do this?

After you've read the file content, you should be able to get an array using numpy.fromstring:
import gzip
import numpy as np
f=gzip.GzipFile('filename')
file_content = f.read()
array = np.fromstring(file_content, dtype='f4')

Related

Efficient Data Serialization format for list of arrays in Python

I have a large list of arrays (data type float32, with a few instances of int) in Python that is to be serialized via UTF-8 encoding and saved on a server. However I'm having issues with the size of the saved file exceeding storage limits.
The available serialization formats that the server can handle is: string, bytes, JSON, and XML. Which of these data formats would be best for saving the data structure?
I think you should also explore using h5 files. I'd just honestly convert list of lists into a numpy array, and then save the numpy array to an h5 file.
You can save like :->
import numpy as np
import h5py
nparr = np.asanyarray(yourArr)
h5f = h5py.File('listData.h5', 'w')
h5f.create_dataset('dataset_1', data=nparr)
h5f.close()
You can load like:
h5f = h5py.File('listData.h5','r')
loadData = h5f['dataset_1'][:]
h5f.close()
H5 files are optimized for storing matrix-like data.
If you need some helper scripts, try out some I made here(shameless plug):
https://github.com/ss4328/h5_manager_scripts
If h5 files are not. your dig, JSON is the best implementation. Tons of free libraries to generate/use the data. Pretty sleek on the web, and not to mention they're human readable too. XML is a bit outdated.

Trying to size down HDF5 File by changing index field types using h5py

I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()

save binary file after reading it using numpy

I have used the following code to parse a binary file using numpy. After reading the binary data, I want to save it to somewhere because I want to use this extracted data for another use/purpose. I am trying to use np.save but it is not saving anywhere. How do I save this data?
import numpy as np
with open(r'file_path', 'rb') as f:
data = np.fromfile(f, dtype=np.int32, count=-1)
print (data)
np.save(filenum,data)

Two implementations of Numpy fromfile?

I am trying to update some legacy code that uses np.fromfile in a method. When I try searching the numpy source for this method I only find np.core.records.fromfile, but when you search the docs you can find np.fromfile. Taking a look at these two methods you can see they have different kwargs which makes me feel like they are different methods altogether.
My questions are:
1) Where is the source for np.fromfile located?
2) Why are there two different functions under the same name? This can clearly get confusing if you aren't careful about the difference as the two behave differently. Specifically np.core.records.fromfile will raise errors if you try to read more bytes than a file contains while np.fromfile does not. You can find a minimal example below.
In [1]: import numpy as np
In [2]: my_bytes = b'\x04\x00\x00\x00\xac\x92\x01\x00\xb2\x91\x01'
In [3]: with open('test_file.itf', 'wb') as f:
f.write(my_bytes)
In [4]: with open('test_file.itf', 'rb') as f:
result = np.fromfile(f, 'int32', 5)
In [5]: result
Out [5]:
In [6]: with open('test_file.itf', 'rb') as f:
result = np.core.records.fromfile(f, 'int32', 5)
ValueError: Not enough bytes left in file for specified shape and type
If you use help on np.fromfile you will find something very... helpful:
Help on built-in function fromfile in module numpy.core.multiarray:
fromfile(...)
fromfile(file, dtype=float, count=-1, sep='')
Construct an array from data in a text or binary file.
A highly efficient way of reading binary data with a known data-type,
as well as parsing simply formatted text files. Data written using the
`tofile` method can be read using this function.
As far as I can tell, this is implemented in C and can be found here.
If you are trying to save and load binary data, you shouldn't use np.fromfile anymore. You should use np.save and np.load which will use a platform-independent binary format.

Writing Fortran unformatted files with Python

I have some single-precision little-endian unformatted data files written by Fortran77. I am reading these files using Python using the following commands:
import numpy as np
original_data = np.dtype('float32')
f = open(file_name,'rb')
original_data = np.fromfile(f,dtype='float32',count=-1)
f.close()
After some data manipulation in Python, I (am trying to) write them back in the original format using Python using the following commands:
out_file = open(output_file,"wb")
s = struct.pack('f'*len(manipulated_data), *manipulated_data)
out_file.write(s)
out_file.close()
But it doesn't seem to be working. Any ideas what is the right way of writing the data using Python back in the original fortran unformatted format?
Details of the problem:
I am able to read the final file with manipulated data from Fortran. However, I want to visualize these data using a software (Paraview). For this I convert the unformatted data files in the *h5 format. I am able to convert both the original and manipulated data in h5 format using h5 utilities. But while Paraview is able to read the *h5 files created from original data, Paraview is not able to read the *h5 files created from the manipulated data. I am guessing something is being lost in translation.
This is how I am opening the file written by Python in Fortran (single precision data):
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
And this is I am writing the original unformatted data by Fortran:
open(out_file_id,FILE=out_file,form="unformatted")
Is this information sufficient?
Have you tried using the .tofile method of the manipulated data array? It will write the array in C order but is capable of writing plain binary.
The documentation for .tofile also suggests this is the same as:
with open(outfile, 'wb') as fout:
fout.write(manipulated_data.tostring())
this is creating an unformatted sequential access file:
open(out_file_id,FILE=out_file,form="unformatted")
Assuming you are writing a single array real a(n,n,n) using simply write(out_file_id)a you should see a file size 4*n^3+8 bytes. The extra 8 bytes being a 4 byte integer (=4n^3) repeated at the start and end of the record.
the second form:
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
opens direct acess, which does not have those headers. For writing now you'd have write(unit,rec=1)a. If you read your sequential access file using direct acess it will read without error but you'll get that integer header read as a float (garbage) as the (1,1,1) array value, then everything else is shifted. You say you can read with fortran ,but are you looking to see that you are really reading what you expect?
The best fix to this is to fix your original fortran code to use unformatted,direct access for both reading and writing. This gives you an 'ordinary' raw binary file, no headers.
Alternately in your python you need to first read that 4 byte integer, then your data. On output you could put the integer headers back or not depending on what your paraview filter is expecting.
---------- here is python to read/modify/write an unformatted sequential fortran file containing a single record:
import struct
import numpy as np
f=open('infile','rb')
recl=struct.unpack('i',f.read(4))[0]
numval=recl/np.dtype('float32').itemsize
data=np.fromfile(f,dtype='float32',count=numval)
endrec=struct.unpack('i',f.read(4))[0]
if endrec is not recl: print "error unexpected end rec"
f.close()
f=open('outfile')
f.write(struct.pack('i',recl))
for i in range(0,len(data)):data[i] = data[i]**2 #example data modification
data.tofile(f)
f.write(struct.pack('i',recl)
just loop for multiple records.. note that the data here is read as a vector and assumed to be all floats. Of course you need to know the actuall data type to make use if it..
Also be aware you may need to deal with byte order issues depending on platform.

Categories