assemble numpy array from different hdf5 datasets in one file - python

So I have this large hdf5 file that features multiple large datsets. I am accessing it with h5py and want to read parts of every of those datasets into a common ndaray. Unfortunately, slicing across datasets is not supported, so I was wondering, what the most efficient way is, to assemble the ndarray with those datasets given the circumstances?
Currently, I am using something along the following lines:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(assemble_array, range(NrDatasets))
where the function assemble_array reads in the data into a predefined ndarray buffer of appropriate size, but this is not fast enough :/
Can anyone help?

Related

Faster pytorch dataset file

I have the following problem, I have many files of 3D volumes that I open to extract a bunch of numpy arrays.
I want to get those arrays randomly, i.e. in the worst case I open as many 3D volumes as numpy arrays I want to get, if all those arrays are in separate files.
The IO here isn't great, I open a big file only to get a small numpy array from it.
Any idea how I can store all these arrays so that the IO is better?
I can't pre-read all the arrays and save them all in one file because then that file would be too big to open for RAM.
I looked up LMDB but it all seems to be about Caffe.
Any idea how I can achieve this?
I iterated through my dataset, created an hdf5 file and stored elements in the hdf5. Turns out, when the hdf5 is opened, it doesn't load all data in ram, it loads the header instead.
The header is then used to fetch the data on request, that's how I solved my problem.
Reference:
http://www.machinelearninguru.com/deep_learning/data_preparation/hdf5/hdf5.html
One trivial solution can be pre-processing your dataset and saving multiple smaller crops of the original 3D volumes separately. This way you sacrifice some disk space for more efficient IO.
Note that you can make a trade-off with the crop size here: saving bigger crops than you need for input allows you to still do random crop augmentation on the fly. If you save overlapping crops in the pre-processing step, then you can ensure that still all possible random crops of the original dataset can be produced.
Alternatively you may try using a custom data loader that retains the full volumes for a few batch. Be careful, this might create some correlation between batches. Since many machine learning algorithms relies on i.i.d samples (e.g. Stochastic Gradient Descent), correlated batches can easily cause some serious mess.

how can I save super large array into many small files?

In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.

Store Numpy as pickled Pandas, Pickled Numpy or HDF5

I right now working with 300 float features coming from a preprocessing of item information. Such items are identified by a UUID (i.e. a string). The current file size is around 200MB. So far I have stored them as Pickled numpyarrays. Sometimes I need to map the UUID for an item to a Numpy row. For that I am using a dictionary (stored as json) that maps UUID to row in a numpy array.
I was tempted to use Pandas and replace that dictionary for a Pandas index. I also discovered the HF5 file format but I would like to know a bit more when to use each of them.
I use part of the array to feed a scikit-Learn based algorithm and then to perform classification on the rest.
Storing pickled numpy arrays is indeed not an optimal approach. Instead, you can use,
numpy.savez to save a dictionary of numpy array in a binary format
store pandas DataFrame in HDF5
directly use PyTables to write your numpy arrays to HDF5.
HDF5 is a preferred format to store scientific data that includes, among others,
parallel read/write capabilities
on the fly compression algorithms
efficient querying
ability to work with large datasets that don't fit in the RAM.
Although, the choice of the output file format to store a small dataset of 200MB is not that critical and is more a matter of convenience.

Working with very large arrays - Numpy

My situation is like this:
I have around ~70 million integer values distributed in various files for ~10 categories of data (exact number not known)
I read those several files, and create some python object with that data. This would obviously include reading each file line by line and appending to the python object. So I'll have an array with 70 mil subarrays, with 10 values in each.
I do some statistical processing on that data . This would involve appending several values (say, percentile rank) to each 'row' of data.
I store this object it in a Database
Now I have never worked with data of this scale. My first instinct was to use Numpy for more efficient arrays w.r.t memory. But then I've heard that in Numpy arrays, 'append' is discouraged as it's not as efficient.
So what would you suggest I go with? Any general tips for working with data of this size? I can bring the data down to 20% of its size with random sampling if it's required.
EDIT: Edited for clarity about size and type of data.
If I understand your description correctly, your dataset will contain ~700 million integers. Even if you use 64-bit ints that would still only come to about 6GB. Depending on how much RAM you have and what you want to do in terms of statistical processing, your dataset sounds like it would be quite manageable as a normal numpy array living in core memory.
If the dataset is too large to fit in memory, a simple solution might be to use a memory-mapped array (numpy.memmap). In most respects, an np.memmap array behaves like a normal numpy array, but instead of storing the whole dataset in system memory, it will be dynamically read from/written to a file on disk as required.
Another option would be to store your data in an HDF5 file, for example using PyTables or H5py. HDF5 allows the data to be compressed on disk, and PyTables includes some very fast methods to perform mathematical operations on large disk-based arrays.

How to save big array so that it will take less memory in python?

I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?
Thanks.
Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.
Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.
You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.
Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."
If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:
import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()
# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
# `usable_a_copy` affect the saved data.
copy_as_nparray = usable_a_copy.values
With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.

Categories