I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.
Related
I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
I want to save multiple large-sized numpy arrays to a numpy binary file to prevent my code from crashing, but it seems like it keeps getting overwritten when I add on an array. The last array saved is what is set to allarrays when save.npy is opened and read. Here is my code:
with open('save.npy', 'wb') as f:
for num in range(500):
array = np.random.rand(100,400)
np.save(f, array)
with open('save.npy', 'rb') as f:
allarrays = np.load(f)
If the file existed before, I want it to be overwritten if the code is rerun. That's why I chose 'wb' instead of 'ab'.
alist =[]
with open('save.npy', 'rb') as f:
alist.append(np.load(f))
When you load you have collect all loads in a list or something. load only loads one array, starting at the current file position.
You can try memory mapping to disk.
# merge arrays using memory mapped file
mm = np.memmap("mmap.bin", dtype='float32', mode='w+', shape=(500,100,400))
for num in range(500):
mm[num::] = np.random.rand(100,400)
# save final array to npy file
with open('save.npy', 'wb') as f:
np.save(f, mm[::])
I ran into this problem as well, and solved it in not a very neat way, but perhaps it's useful for others. It's inspired by hpaulj's approach, which is incomplete (i.e., doesn't load the data). Perhaps this is not how one is supposed to solve this problem to begin with...but anyhow, read on.
I had saved my data using a similar procedure as the OP,
# Saving the data in a for-loop
with open(savefilename, 'wb') as f:
for datafilename in list_of_datafiles:
# Do the processing
data_to_save = ...
np.save( savefilename, data_to_save )
And ran into the problem that calling np.load() only loaded the last saved array, none of the rest. However, I knew that the data was in principle contained in the *.npy file, given the file size was growing during the saving loop. What was required was to simply loop over the content of the numpy array while calling the load command repeatedly. As I didn't quite know how many files were contained in the file, I simply looped over the loading loop until it failed. It's hacky, but it works.
# Loading the data in a for-loop
data_to_read = []
with open(savefilename, 'r') as f:
while True:
try:
data_to_read.append( np.load(f) )
except:
print("all data has been read!")
break
Then you can call, e.g., len(data_to_read) to see how many of the arrays are contained in it. Calling, e.g., data_to_read[0] gives you the first saved array, etc.
I would like to benchmark NetCDF I/O performance compared to reading a plain binary file in Python and C.
For reading the binary file, I have defined different buffer sizes to read. Here is the code of the read function in Python:
def parse_os_read(filename, buffersize):
f = open(filename, 'rb')
for loops in itertools.count():
record = f.read(buffersize)
if not record:
break
return loops
I can then call this function with different buffer sizes ranging from 64 KB to several 100 MB.
I would like to implement now a comparable sequential reading benchmark for NetCDF. However, I could not identify a read function, where I can set a pre-defined buffer size. I assume the NetCDF file should have a 1D array with random floats as a data structure.
Any thoughts?
I am currently trying to read an fortran file with python with the following technique
with open(myfile, "rb") as f:
for i in range (0, n):
s = struct.unpack('=f', f.read(4))
mylist.append(s[0])
But it is very slow for large arrays. Is there a way to read the content of the entire loop in one time and put it to mylist in order to avoid a conversion/append of each item one by one?
Thank you very much.
This is what the array module is for:
a = array.array('f')
a.fromfile(f, n)
Now you can use the array object like a normal sequence type. You can also convert it to a list if you need to with tolist.
I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.
I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.
It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?
PyTables is a package for managing hierarchical datasets. It is designed to solve this problem for you.
NumPy's memory-mapped data structure (memmap) might be a good choice here.
You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.
(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)
The method signature is:
A = NP.memmap(filename, dtype, mode, shape, order='C')
All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.
The two methods are:
flush (which writes to disk any changes you make to the array); and
close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)
example use:
import numpy as NP
from tempfile import mkdtemp
import os.path as PH
my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")
fname = PH.join(mkdtemp(), 'tempfile.dat')
mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)
# now write the data to the memmap array:
mm_obj[:] = data[:]
# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))
# verify that it's there!:
print(mm_obj[:20,:])
It seems like the ideal solution would
be something like a generator which
periodically loaded a block of the
data from the disk and then passed the
array values out one by one. This
would substantially reduce the amount
of memory required by the program
without requiring any extra work on
the part of the individual query
functions. Is it possible to do
something like this?
Yes, but not by keeping the arrays on disk in a single pickle -- the pickle protocol just isn't designed for "incremental deserialization".
You can write multiple pickles to the same open file, one after the other (use dump, not dumps), and then the "lazy evaluator for iteration" just needs to use pickle.load each time.
Example code (Python 3.1 -- in 2.any you'll want cPickle instead of pickle and a -1 for protocol, etc, of course;-):
>>> import pickle
>>> lol = [range(i) for i in range(5)]
>>> fp = open('/tmp/bah.dat', 'wb')
>>> for subl in lol: pickle.dump(subl, fp)
...
>>> fp.close()
>>> fp = open('/tmp/bah.dat', 'rb')
>>> def lazy(fp):
... while True:
... try: yield pickle.load(fp)
... except EOFError: break
...
>>> list(lazy(fp))
[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]
>>> fp.close()