I have a large numpy float array (~4k x 16k float 64) that I want to store on disk. I am trying to understand the differences in the following compression approaches :
1) Use np.save - Save in .npy format and zip this using GZIP (like in one of the answers to Compress numpy arrays efficiently)
f = gzip.GzipFile("my_file.npy.gz", "w")
numpy.save(f, my_array)
f.close()
I get equivalent file sizes if I do the following as well
numpy.save('my_file',my_array)
check_call(['gzip', os.getcwd()+'/my_file.npy'])
2) Write the array into a binary file using tofile(). Close the file and zip this generated binary file using GZIP.
f = open("my_file","wb")
my_array.tofile(f)
f.close()
with open('my_file', 'rb') as f_in:
with gzip.open('my_file.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
The above is a workaround to the following code which does not achieve any compression. This is expected according to the GzipFile docs.
f = gzip.GzipFile("my_file_1.gz", "w")
my_array.tofile(f)
f.close()
Here is my question: The file size using 1) is about 6 times smaller than that using 2). From what I understand in the .npy format, it is the exact same way as a binary file with the exception of some headers etc. to preserve array shape. I don't see any reason as to why the file sizes should differ so drastically.
Related
How can I save a ragged tensor as a file on my disk and then reuse it in calculations opening it from the disk? Tensor consists of a nested array of numbers with 4 signs after the point. (I'm working in a Google Colab and using Google disk to save my files, I know only Python a little bit).
Here is my data:
I take this column "sim_fasttex" which is a list of lists of different length, reshape each of them according to "h" and "w" and collect all these matrices in one list, so finally it's going to be a ragged tensor of the shape (number of rows in initial table, variable length of a matrix, variable heigth of a matrix)
I don't know your context but,
You can save any object to a file using the pickle module. Like this
import pickle
the_object = object
with open("a_file_name.pkl", "wb") as f:
pickle.dump(the_object, f)
And later you can load that same object:
import pickle
with open("a_file_name.pkl", "rb") as f:
the_object = pickle.load(f)
I have a binary file, several hundred MBs in size. It contains samples in float32 big-endian format (4 bytes per sample). I want to convert them to little-endian format. Some background: I want to write them to a .wav file later on and that needs data in the little-endian format afaik.
The below code is what I currently use. It seems to work fine, but is quite slow (I assume because I am writing 4 bytes at a time only):
import struct
infile = "infile_big_endian.raw"
outfile = "outfile_little_endian.raw"
with open(infile, "rb") as old, open(outfile , "wb") as new:
for chunk in iter(lambda: old.read(4), b""):
chunk = struct.pack("<f", struct.unpack(">f", chunk)[0])
new.write(chunk)
Is there a quicker way to do this in python?
NumPy might be faster:
numpy.memmap(infile, dtype=numpy.int32).byteswap().tofile(outfile)
Or overwriting the input file:
numpy.memmap(infile, dtype=numpy.int32).byteswap(inplace=True).flush()
We memory-map the array and use byteswap to reverse the endianness at C speed. I've used int32 instead of float32 just in case NaNs might be a problem with float32.
I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.
I need to read a compressed unformatted binary file in as an array of floats. The only way that I have found to do this, is to use os to unzip, read it using np.fromfile, and then zip it up again.
os.system('gunzip filename.gz')
array = np.fromfile('filename','f4')
os.system('gzip filename')
However, this is not acceptable. Apart from being messy, I need to read files when I don't have write permission. I understand that np.fromfile can not directly read a compressed file. I have found people recommending that I use this:
f=gzip.GzipFile('filename')
file_content = f.read()
But this returns something like this: '\x00\x80\xe69\x00\x80\xd19\x00\x80'
instead of an array of floats. Does anyone know how to convert this output into an array of floats, or have a better way to do this?
After you've read the file content, you should be able to get an array using numpy.fromstring:
import gzip
import numpy as np
f=gzip.GzipFile('filename')
file_content = f.read()
array = np.fromstring(file_content, dtype='f4')
I have saved a large array of complex numbers using python,
numpy.save(file_name, eval(variable_name))
that worked without any trouble. However, loading,
variable_name=numpy.load(file_name)
yields the following error,
ValueError: total size of new array must be unchanged
Using: Python 2.7.9 64-bit and the file is 1.19 GB large.
There is no problem with the size of your array, you likely didn't opened your file in the right way, try this:
with open(file_name, "rb") as file_:
variable_name = np.load(file_)
Alternatively you can use pickle:
import pickle
# Saving:
data_file = open('filename.bi', 'w')
pickle.dump(your_data, data_file)
data_file.close()
# Loading:
data_file = open('filename.bi')
data = pickle.load(data_file)
data_file.close()