How to save ragged tensor as a file? - python

How can I save a ragged tensor as a file on my disk and then reuse it in calculations opening it from the disk? Tensor consists of a nested array of numbers with 4 signs after the point. (I'm working in a Google Colab and using Google disk to save my files, I know only Python a little bit).
Here is my data:
I take this column "sim_fasttex" which is a list of lists of different length, reshape each of them according to "h" and "w" and collect all these matrices in one list, so finally it's going to be a ragged tensor of the shape (number of rows in initial table, variable length of a matrix, variable heigth of a matrix)

I don't know your context but,
You can save any object to a file using the pickle module. Like this
import pickle
the_object = object
with open("a_file_name.pkl", "wb") as f:
pickle.dump(the_object, f)
And later you can load that same object:
import pickle
with open("a_file_name.pkl", "rb") as f:
the_object = pickle.load(f)

Related

How can I create a Numpy Array that is much bigger than my RAM from 1000s of CSV files?

I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())

Reading Large HDF5 Files

I am new to using HDF5 files and I am trying to read files with shapes of (20670, 224, 224, 3). Whenever I try to store the results from the hdf5 into a list or another data structure, it takes either takes so long that I abort the execution or it crashes my computer. I need to be able to read 3 sets of hdf5 files, use their data, manipulate it, use it to train a CNN model and make predictions.
Any help for reading and using these large HDF5 files would be greatly appreciated.
Currently this is how I am reading the hdf5 file:
db = h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5")
training_db = list(db['data'])
Crashes probably mean you are running out of memory. Like Vignesh Pillay suggested, I would try chunking the data and work on a small piece of it at a time. If you are using the pandas method read_hdf you can use the iterator and chunksize parameters to control the chunking:
import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
#train cnn on chunk here
print(chunk.shape)
Note this requires the hdf to be in table format
My answer updated 2020-08-03 to reflect code you added to your question.
As #Tober noted, you are running out of memory. Reading a dataset of shape (20670, 224, 224, 3) will become a list of 3.1G entities. If you read 3 image sets, it will require even more RAM.
I assume this is image data (maybe 20670 images of shape (224, 224, 3) )?
If so, you can read the data in slices with both h5py and tables (Pytables).
This will return the data as a NumPy array, which you can use directly (no need to manipulate into a different data structure).
Basic process would look like this:
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
training_db = db['data']
# loop to get images 1 by 1
for icnt in range(20670) :
image_arr = training_db [icnt,:,:,:}
# then do something with the image
You could also read multiple images by setting the first index to a range (say icnt:icnt+100) then handle looping appropriately.
Your problem is arising as you are running out of memory. So, Virtual Datasets come in handy while dealing with large datasets like yours. Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. You can read more about them here https://docs.h5py.org/en/stable/vds.html
I would recommend you to start from one file at a time. Firstly, create a Virtual Dataset file of your existing data like
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
data_shape = db['data'].shape
layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
vsource = h5py.VirtualSource(db['data'])
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
file.create_virtual_dataset('data', layout = layout, fillvalue = 0)
This will create a virtual dataset of your existing training data. Now, if you want to manipulate your data, you should open your file in r+ mode like
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
# Do whatever manipulation you want to do here
One more thing I would like to advise is make sure your indices while slicing are of int datatype, otherwise you will get an error.

Saving multiple Numpy arrays to a Numpy binary file (Python)

I want to save multiple large-sized numpy arrays to a numpy binary file to prevent my code from crashing, but it seems like it keeps getting overwritten when I add on an array. The last array saved is what is set to allarrays when save.npy is opened and read. Here is my code:
with open('save.npy', 'wb') as f:
for num in range(500):
array = np.random.rand(100,400)
np.save(f, array)
with open('save.npy', 'rb') as f:
allarrays = np.load(f)
If the file existed before, I want it to be overwritten if the code is rerun. That's why I chose 'wb' instead of 'ab'.
alist =[]
with open('save.npy', 'rb') as f:
alist.append(np.load(f))
When you load you have collect all loads in a list or something. load only loads one array, starting at the current file position.
You can try memory mapping to disk.
# merge arrays using memory mapped file
mm = np.memmap("mmap.bin", dtype='float32', mode='w+', shape=(500,100,400))
for num in range(500):
mm[num::] = np.random.rand(100,400)
# save final array to npy file
with open('save.npy', 'wb') as f:
np.save(f, mm[::])
I ran into this problem as well, and solved it in not a very neat way, but perhaps it's useful for others. It's inspired by hpaulj's approach, which is incomplete (i.e., doesn't load the data). Perhaps this is not how one is supposed to solve this problem to begin with...but anyhow, read on.
I had saved my data using a similar procedure as the OP,
# Saving the data in a for-loop
with open(savefilename, 'wb') as f:
for datafilename in list_of_datafiles:
# Do the processing
data_to_save = ...
np.save( savefilename, data_to_save )
And ran into the problem that calling np.load() only loaded the last saved array, none of the rest. However, I knew that the data was in principle contained in the *.npy file, given the file size was growing during the saving loop. What was required was to simply loop over the content of the numpy array while calling the load command repeatedly. As I didn't quite know how many files were contained in the file, I simply looped over the loading loop until it failed. It's hacky, but it works.
# Loading the data in a for-loop
data_to_read = []
with open(savefilename, 'r') as f:
while True:
try:
data_to_read.append( np.load(f) )
except:
print("all data has been read!")
break
Then you can call, e.g., len(data_to_read) to see how many of the arrays are contained in it. Calling, e.g., data_to_read[0] gives you the first saved array, etc.

python struct.pack and write vs matlab fwrite

I am trying to port this bit of matlab code to python
matlab
function write_file(im,name)
fp = fopen(name,'wb');
M = size(im);
fwrite(fp,[M(1) M(2) M(3)],'int');
fwrite(fp,im(:),'float');
fclose(fp);
where im is a 3D matrix. As far as I understand, the function first writes a binary file with a header row containing the matrix size. The header is made of 3 integers. Then, the im is written as a single column of floats. In matlab this takes few seconds for a file of 150MB.
python
import struct
import numpy as np
def write_image(im, file_name):
with open(file_name, 'wb') as f:
l = im.shape[0]*im.shape[1]*im.shape[2]
header = np.array([im.shape[0], im.shape[1], im.shape[2]])
header_bin = struct.pack("I"*3, *header)
f.write(header_bin)
im_bin = struct.pack("f"*l,*np.reshape(im, (l,1), order='F'))
f.write(im_bin)
f.close()
where im is a numpy array. This code works well as I compared with the binary returned by matlab and they are the same. However, for the 150MB file, it takes several seconds and tends to drain all the memory (in the image linked I stopped the execution to avoid it, but you can see how it builds up!).
This does not make sense to me as I am running the function on a 15GB of RAM PC. How come a 150MB file processing requires so much memory?
I'd happy to use a different method, as far as it is possible to have two formats for the header and the data column.
There is no need to use struct to save your array. numpy.ndarray has a convenience method for saving itself in binary mode: ndarray.tofile. The following should be much more efficient than creating a gigantic string with the same number of elements as your array:
def write_image(im, file_name):
with open(file_name, 'wb') as f:
np.array(im.shape).tofile(f)
im.T.tofile(f)
tofile always saves in row-major C order, while MATLAB uses column-major Fortran order. The simplest way to get around this is to save the transpose of the array. In general, ndarray.T should create a view (wrapper object pointing to the same underlying data) instead of a copy, so your memory usage should not increase noticeably from this operation.

Read binary file into Python 2D array

I have a large binary file that I want to read as a 48x1414339 array. I read it in this way:
f = open(fname, 'rb')
s = f.read()
import array
a = array.array('f',s)
But this gives me a 1D string. Is there a way to keep the columns distinct?
Wrap it in a class and implement e.g. __getitem__() to convert an index pair to the linear index. Using separate arrays is probably just adding overhead unless you plan to use rows separately.

Categories