I have used the following code to parse a binary file using numpy. After reading the binary data, I want to save it to somewhere because I want to use this extracted data for another use/purpose. I am trying to use np.save but it is not saving anywhere. How do I save this data?
import numpy as np
with open(r'file_path', 'rb') as f:
data = np.fromfile(f, dtype=np.int32, count=-1)
print (data)
np.save(filenum,data)
Related
I have a very large CSV File (~12Gb) that looks something like this:
posX,posY,posZ,eventID,parentID,clockTime
-117.9853515625,60.2998046875,0.29499998688697815,0,0,0
-117.9853515625,60.32909393310547,0.29499998688697815,0,0,0
-117.9560546875,60.2998046875,0.29499998688697815,0,0,0
-117.9560546875,60.32909393310547,0.29499998688697815,0,0,0
-117.92676544189453,60.2998046875,0.29499998688697815,0,0,0
-117.92676544189453,60.32909393310547,0.29499998688697815,0,0,0
-118.04051208496094,60.34012985229492,4.474999904632568,0,0,0
-118.04051208496094,60.36941909790039,4.474999904632568,0,0,0
-118.04051208496094,60.39870834350586,4.474999904632568,0,0,0
I want to convert this CSV File into the HDF5 Format using the library h5py while also lowering the total file size by setting the field / index types i.G. saying:
Save posX, posY and posZ as float32. Save eventID, parentID and clockTime as int32 or something along those lines.
Note: I need to chunk the data in some form when I read it in to avoid Memory Errors.
However I am unable to get the wished result. What I have tried so far:
Using Pandas own methods following this guide: How to write a large csv file to hdf5 in python?
This creates the file but im somehow unable to change the types and the file remains too big (~10.7Gb). The field types are float64 and int64.
I also tried to split the CSV up into parts (using split -n x myfile.csv) before working with the increments. I ran into some data errors in the beginning and end on each file which I was able to fix by removing said lines using sed. Then I tried out the following code:
import pandas as pd
import h5py
PATH_csv = "/home/MYNAME/Documents/Workfolder/xaa" #xaa is my csv increment
DATA_csv = pd.read_csv(PATH_csv)
with h5py.File("pct_data-hdf5.h5", "a") as DATA_hdf:
dset = DATA_hdf.create_dataset("posX", data=DATA_csv["posX"], dtype="float32")
Sadly this created the file and the table but didn't write any data into it.
Expectation
Creating a HDF5 File containing the data of a large CSV file while also changing the variable type of each index.
If something is unclear please ask me for clarification. Im still a beginner!
Have you considered the numpy module?
It has a handy function (genfromtxt) to read CSV data with headers into a Numpy array. You define the dtype. The array is suitable for loading into HDF5 with the h5py.create_dataset() function.
See code below. I included 2 print statements. The first shows the dtype names created from the CSV headers. The second shows how you can access the data in the numpy array by field (column) name.
import h5py
import numpy as np
PATH_csv = 'SO_55576601.csv'
csv_dtype= ('f8', 'f8', 'f8', 'i4', 'i4', 'i4' )
csv_data = np.genfromtxt(PATH_csv, dtype=csv_dtype, delimiter=',', names=True)
print (csv_data.dtype.names)
print (csv_data['posX'])
with h5py.File('SO_55576601.h5', 'w') as h5f:
dset = h5f.create_dataset('CSV_data', data=csv_data)
h5f.close()
I need to read a compressed unformatted binary file in as an array of floats. The only way that I have found to do this, is to use os to unzip, read it using np.fromfile, and then zip it up again.
os.system('gunzip filename.gz')
array = np.fromfile('filename','f4')
os.system('gzip filename')
However, this is not acceptable. Apart from being messy, I need to read files when I don't have write permission. I understand that np.fromfile can not directly read a compressed file. I have found people recommending that I use this:
f=gzip.GzipFile('filename')
file_content = f.read()
But this returns something like this: '\x00\x80\xe69\x00\x80\xd19\x00\x80'
instead of an array of floats. Does anyone know how to convert this output into an array of floats, or have a better way to do this?
After you've read the file content, you should be able to get an array using numpy.fromstring:
import gzip
import numpy as np
f=gzip.GzipFile('filename')
file_content = f.read()
array = np.fromstring(file_content, dtype='f4')
Is there a single Python command, one each to
write a single Numpy 2D array to a CSV file
read the contents of a CSV file to a single Numpy 2D array
There is csv.reader and csv.writer functions to do the above, but can this be done in a single line command like in Octave/MATLAB like
csvwrite('filename.csv', variable);
variable = csvread('filename.csv');
Check out np.savetxt and its counterpart, np.loadtxt, which will
Save an array to a text file.
and
Load data from a text file
respectively.
By their arguments they can be used to read and write CSV files:
import numpy as np
np.savetxt('fname.csv', variable, delimiter=',')
variable = np.loadtxt('fname.csv', delimiter=',')
I want to load an ARFF file in python, then change some values of it and then save changes to file. I'm using LIAC-ARFF package (https://pypi.python.org/pypi/liac-arff). I loaded ARFF file with following lines of code:
import arff
data = arff.load(open(FILE_NAME, 'rb'))
After manipulating some values inside data, i want to write data to another ARFF file. Any solution?
Use the following code:
import arff
data = arff.load(open(FILE_NAME, 'rb'))
f = open(outputfilename, 'wb')
arff.dump(data, f)
f.close()
In the LICA-ARFF description you see dump method which serializes to a the file, but it's wrong. It just write object as text file. Serialize means save whole the object, so the output file is binary not a text file.
We can load arff data into python using scipy.
from scipy.io import arff
import pandas as pd
data = arff.loadarff('dataset.arff')
df = pd.DataFrame(data[0])
df.head()
I have some single-precision little-endian unformatted data files written by Fortran77. I am reading these files using Python using the following commands:
import numpy as np
original_data = np.dtype('float32')
f = open(file_name,'rb')
original_data = np.fromfile(f,dtype='float32',count=-1)
f.close()
After some data manipulation in Python, I (am trying to) write them back in the original format using Python using the following commands:
out_file = open(output_file,"wb")
s = struct.pack('f'*len(manipulated_data), *manipulated_data)
out_file.write(s)
out_file.close()
But it doesn't seem to be working. Any ideas what is the right way of writing the data using Python back in the original fortran unformatted format?
Details of the problem:
I am able to read the final file with manipulated data from Fortran. However, I want to visualize these data using a software (Paraview). For this I convert the unformatted data files in the *h5 format. I am able to convert both the original and manipulated data in h5 format using h5 utilities. But while Paraview is able to read the *h5 files created from original data, Paraview is not able to read the *h5 files created from the manipulated data. I am guessing something is being lost in translation.
This is how I am opening the file written by Python in Fortran (single precision data):
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
And this is I am writing the original unformatted data by Fortran:
open(out_file_id,FILE=out_file,form="unformatted")
Is this information sufficient?
Have you tried using the .tofile method of the manipulated data array? It will write the array in C order but is capable of writing plain binary.
The documentation for .tofile also suggests this is the same as:
with open(outfile, 'wb') as fout:
fout.write(manipulated_data.tostring())
this is creating an unformatted sequential access file:
open(out_file_id,FILE=out_file,form="unformatted")
Assuming you are writing a single array real a(n,n,n) using simply write(out_file_id)a you should see a file size 4*n^3+8 bytes. The extra 8 bytes being a 4 byte integer (=4n^3) repeated at the start and end of the record.
the second form:
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
opens direct acess, which does not have those headers. For writing now you'd have write(unit,rec=1)a. If you read your sequential access file using direct acess it will read without error but you'll get that integer header read as a float (garbage) as the (1,1,1) array value, then everything else is shifted. You say you can read with fortran ,but are you looking to see that you are really reading what you expect?
The best fix to this is to fix your original fortran code to use unformatted,direct access for both reading and writing. This gives you an 'ordinary' raw binary file, no headers.
Alternately in your python you need to first read that 4 byte integer, then your data. On output you could put the integer headers back or not depending on what your paraview filter is expecting.
---------- here is python to read/modify/write an unformatted sequential fortran file containing a single record:
import struct
import numpy as np
f=open('infile','rb')
recl=struct.unpack('i',f.read(4))[0]
numval=recl/np.dtype('float32').itemsize
data=np.fromfile(f,dtype='float32',count=numval)
endrec=struct.unpack('i',f.read(4))[0]
if endrec is not recl: print "error unexpected end rec"
f.close()
f=open('outfile')
f.write(struct.pack('i',recl))
for i in range(0,len(data)):data[i] = data[i]**2 #example data modification
data.tofile(f)
f.write(struct.pack('i',recl)
just loop for multiple records.. note that the data here is read as a vector and assumed to be all floats. Of course you need to know the actuall data type to make use if it..
Also be aware you may need to deal with byte order issues depending on platform.