Saving multiple Numpy arrays to a Numpy binary file (Python) - python

I want to save multiple large-sized numpy arrays to a numpy binary file to prevent my code from crashing, but it seems like it keeps getting overwritten when I add on an array. The last array saved is what is set to allarrays when save.npy is opened and read. Here is my code:
with open('save.npy', 'wb') as f:
for num in range(500):
array = np.random.rand(100,400)
np.save(f, array)
with open('save.npy', 'rb') as f:
allarrays = np.load(f)
If the file existed before, I want it to be overwritten if the code is rerun. That's why I chose 'wb' instead of 'ab'.

alist =[]
with open('save.npy', 'rb') as f:
alist.append(np.load(f))
When you load you have collect all loads in a list or something. load only loads one array, starting at the current file position.

You can try memory mapping to disk.
# merge arrays using memory mapped file
mm = np.memmap("mmap.bin", dtype='float32', mode='w+', shape=(500,100,400))
for num in range(500):
mm[num::] = np.random.rand(100,400)
# save final array to npy file
with open('save.npy', 'wb') as f:
np.save(f, mm[::])

I ran into this problem as well, and solved it in not a very neat way, but perhaps it's useful for others. It's inspired by hpaulj's approach, which is incomplete (i.e., doesn't load the data). Perhaps this is not how one is supposed to solve this problem to begin with...but anyhow, read on.
I had saved my data using a similar procedure as the OP,
# Saving the data in a for-loop
with open(savefilename, 'wb') as f:
for datafilename in list_of_datafiles:
# Do the processing
data_to_save = ...
np.save( savefilename, data_to_save )
And ran into the problem that calling np.load() only loaded the last saved array, none of the rest. However, I knew that the data was in principle contained in the *.npy file, given the file size was growing during the saving loop. What was required was to simply loop over the content of the numpy array while calling the load command repeatedly. As I didn't quite know how many files were contained in the file, I simply looped over the loading loop until it failed. It's hacky, but it works.
# Loading the data in a for-loop
data_to_read = []
with open(savefilename, 'r') as f:
while True:
try:
data_to_read.append( np.load(f) )
except:
print("all data has been read!")
break
Then you can call, e.g., len(data_to_read) to see how many of the arrays are contained in it. Calling, e.g., data_to_read[0] gives you the first saved array, etc.

Related

Saving deque to file minding performance and portability

I have a while loop that collects data from a microphone (replaced here for np.random.random() to make it more reproducible). I do some operations, let's say I take the abs().mean() here because my output will be a one dimensional array.
This loop is going to run for a LONG time (e.g., once a second for a week) and I am wondering my options to save this. My main concerns are saving the data with acceptable performance and having the result being portable (e.g, .csv beats .npy).
The simple way: just append things into a .txt file. Could be replaced by csv.gz maybe? Maybe using np.savetxt()? Would it be worth it?
The hdf5 way: this should be a nicer way, but reading the whole dataset to append to it doesn't seem like good practice or better performing than dumping into a text file. Is there another way to append to hdf5 files?
The npy way (code not shown): I could save this into a .npy file but I would rather make it portable using a format that could be read from any program.
from collections import deque
import numpy as np
import h5py
amplitudes = deque(maxlen=save_interval_sec)
# Read from the microphone in a continuous stream
while True:
data = np.random.random(100)
amplitude = np.abs(data).mean()
print(amplitude, end="\r")
amplitudes.append(amplitude)
# Save the amplitudes to a file every n iterations
if len(amplitudes) == save_interval:
with open("amplitudes.txt", "a") as f:
for amp in amplitudes:
f.write(str(amp) + "\n")
amplitudes.clear()
# Save the amplitudes to an HDF5 file every n iterations
if len(amplitudes) == save_interval:
# Convert the deque to a Numpy array
amplitudes_array = np.array(amplitudes)
# Open an HDF5 file
with h5py.File("amplitudes.h5", "a") as f:
# Get the existing dataset or create a new one if it doesn't exist
dset = f.get("amplitudes")
if dset is None:
dset = f.create_dataset("amplitudes", data=amplitudes_array, dtype=np.float32,
maxshape=(None,), chunks=True, compression="gzip")
else:
# Get the current size of the dataset
current_size = dset.shape[0]
# Resize the dataset to make room for the new data
dset.resize((current_size + save_interval,))
# Write the new data to the dataset
dset[current_size:] = amplitudes_array
# Clear the deque
amplitudes.clear()
# For debug only
if len(amplitudes)>3:
break
Update
I get that the answer might depend a bit on the sampling frequency (once a second might be too slow) and the data dimensions (single column might be too little). I guess I asked because anything can work, but I always just dump to text. I am not sure where the breaking points are that tip the decision into one or the other method.

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

How to read more efficiently in Python?

I am trying to access a file, it has 27000+ lines so when I read it takes too long that is 30mins or more. Now just to clarify I am running it in a Coursera external Jupyter notebook, so I don't think it is a system limitation.
with open(filename) as training_file:
# Your code starts here
file = training_file.read()
lines = file.split('\n')
images = []
labels = []
images = np.array(images)
labels = np.array(labels)
c=0
for line in lines[1:]:
row = line.split(',')
labels = np.append(labels, row[0])
images = np.append(images, np.array_split(row[1:], 28))
c += 1
print(c)
images = images.astype(np.float64)
labels = labels.astype(np.float64)
# Your code ends here
return images, labels
Use the built-in numpy functions for reading a CSV (fromfile, genfromtxt etc) rather than rolling your own; they're written in C and much faster than doing the same thing in Python.
Are you sure that it takes too much time because of file reading? Comment out numpy code and run only file read part. In my opinion, numpy.append is the slowest part. Have a look at this: NumPy append vs Python append
you can save on memory by reading the file line by line with a for loop:
with open("filename") as f:
for line in f:
<your code>
but as mentioned in other comments, there are CSV tools you can use that will be way faster: see csv or numpy

python struct.pack and write vs matlab fwrite

I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.

OSError 24 (Too many open files) when reading bunch of FITS with astropy.io

I’m trying to load into memory a few 2 000 FITS using astropy.io.fits:
def readfits(filename):
with fits.open(filename) as ft:
# the fits contain a single HDU
data = ft[0].data
return data
data_sci = []
for i in range(2000):
data_sci.append(readfits("filename_{}.fits".format(i)))
However, when reaching the 1015th file, OSError: [Errno 24] Too many open
files is raised.
I have the same issue with:
def readfits(filename):
ft = fits.open(filename) as ft:
data = ft[0].data
ft.close()
return data
I suspect that astropy.io.fits does not properly close the file. Is there a
way I can force the files to be closed?
Your readfits function actually leaves the file handle open in order to keep access to the data, because by default it creates a mmap to the data and does not read it entirely into physical memory, as explained: http://astropy.readthedocs.org/en/latest/io/fits/appendix/faq.html#i-m-opening-many-fits-files-in-a-loop-and-getting-oserror-too-many-open-files
Incidentally, if you just want a function that reads the data out of the first HDU this is already built in: http://docs.astropy.org/en/v1.0.5/io/fits/api/files.html#astropy.io.fits.getdata
It's not necessary to reinvent the wheel.
After taking a look at the astropy documentation i found this: http://astropy.readthedocs.org/en/latest/io/fits/appendix/faq.html#i-m-opening-many-fits-files-in-a-loop-and-getting-oserror-too-many-open-files
You can call this function and store its output as long as you have memory. I thought it worths mentioning the answer explicitly but the credit goes to Iguananaut, bkaf, and this page.
def get_single_fits_data(fits_dir):
hdul = fits.open(fits_dir)
for hdu in hdul:
image_data = hdu.data.copy()
hdul.close()
gc.collect()
return image_data

Categories