when is h5py useful in storing data? - python

I am using h5py to store data using python
import h5py
def store(eigenvalues,eigenvectors,name='01_'):
datafile = h5py.File(name+'data.h5', 'w')
datafile['eigenvalues'] = eigenvalues
datafile['eigenvectors'] = (eigenvectors)
datafile.close()
print "Successfully saved eigenvalues and eigenvectors"
It is really useful to store these large numbers
But when trying to store say two columns of data only, I found saving it normal data file is more space efficient.
Is there a critical data size above which h5py format storage will be more efficient?
Also is there any other not-obvious advantage of using this format?

There are lots of advantages of using HDF5. As #EnricoGiampieri says, it's generally used for storing large ensembles of data, rather than just single arrays. It is also useful for storing all the associated metadata at the same time.
From the HDF5 website
The HDF5 technology suite includes:
A versatile data model that can represent very complex data objects and a wide variety of metadata.
A completely portable file format with no limit on the number or size of data objects in the collection.
A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements
a high-level API with C, C++, Fortran 90, and Java interfaces.
A rich set of integrated performance features that allow for access time and storage space optimizations.
Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.
Its a hierarchical data format which is self-describing - which means that the datasets in the file are easily discoverable. It scales to very large file sizes and massively parallel I/O.
As regards compression, this is a property of an individual dataset and needs to be specified when you create that dataset. There are several different options for what compression algorithm to use - GZIP, SZIP and LZF are all supported. There is more information on the h5py wiki.
To apply compression to your file, try this:
import h5py
def store(eigenvalues,eigenvectors,name='01_'):
datafile = h5py.File(name+'data.h5', 'w')
eigenvalues_dset = datafile.create_dataset('eigenvalues', eigenvalues.shape, eigenvalues.dtype, compression='gzip', compression_opts=4)
eigenvectors_dset = datafile.create_dataset('eigenvectors', eigenvalues.shape, eigenvectors.dtype, compression='gzip', compression_opts=4)
datafile['eigenvalues'][:] = eigenvalues
datafile['eigenvectors'][:] = (eigenvectors)
datafile.close()
print "Successfully saved eigenvalues and eigenvectors"
Here I've assumed that eigenvalues and eigenvectors are both numpy arrays. You should convert them if they are not (just use numpy.array(eigenvalues)). Also note that to assign the datasets, I've used [:] - this is because datafile['eigenvalues'] is an HDF5 object, while datafile['eigenvalues'][:] is the actual data in that object. The HDF5 object holds not just the data, but also attributes and metadata.

Related

Dask Read Data from Binary File

I am trying to implement an out of core processing version of k-means clustering algorithm in python. I learned about dask from this git project K-Mean Parallel... Dask...
I use the same git project but am trying to load my data which is in the form of a binary file. The binary file contains data points with 1024 floating point features each.
My problem is how do I load data if it is very huge i.e. larger than the available memory itself? I tried to use the numpy's fromFile function but my kernel somehow dies. Some of my questions are:
Q. Is it possible to load data into numpy created from some other source (the file was not created by numpy but a c script)?
Q. Is there a module for dask that can load data directly from a binary file? I have seen csv files used but nothing related to binary files.
I only dabble in Dask, but calling np.fromfile in the code below via Dask delayed should allow you to work with it lazily. That said, I'm working on this myself so this is currently a partial answer.
For your first question: I currently load .bin files created by a Labview program with no issues using similar code to this:
import numpy as np
method = "b+r" # binary read method
chunkSize = 1e6 # chunk as needed for your purposes
fileSize = os.path.getsize(myfile)
data = []
with open(myfile,method) as file:
for chunk in range(0,fileSize,chunkSize):
data.append(np.fromfile(file,dtype=np.float32,chunk))
For the second question: I have not been able to find anything in Dask for dealing with binary files. I find that a conversion to something that Dask can use, is worth it.

how can I save super large array into many small files?

In linux 64bit environment, I have very big float64 array (single one will be 500GB to 1TB). I would like to access these arrays in numpy with uniform way: a[x:y]. So I do not want to access the array as segments file by file. Is there any tools that I can create memmap over many different files? Can hdf5 or pytables store a single CArray into many small files? Maybe something similar to the fileInput? Or Can I do something with the file system to simulate a single file?
In matlab I've been using H5P.set_external to do this. Then I can create a raw dataset and access it as a big raw file. But I do not know if I can create numpy.ndarray over these dataset in python. Or can I spread a single dataset over many small hdf5 files?
and unfortunately the H5P.set_chunk does not work with H5P.set_external, because set_external only work with continuous data type not chunked data type.
some related topics:
Chain datasets from multiple HDF5 files/datasets
I would use hdf5. In h5py, you can specify a chunk size which makes retrieving small pieces of the array efficient:
http://docs.h5py.org/en/latest/high/dataset.html?#chunked-storage
You can use dask. dask arrays allow you to create an object that behaves like a single big numpy array but represents the data stored in many small HDF5 files. dask will take care of figuring out how any operations you carry out relate to the underlying on-disk data for you.

How to save big array so that it will take less memory in python?

I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?
Thanks.
Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.
Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.
You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.
Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."
If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:
import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()
# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
# `usable_a_copy` affect the saved data.
copy_as_nparray = usable_a_copy.values
With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.

Dictionary-like efficient storing of scipy/numpy arrays

BACKGROUND
The issue I'm working with is as follows:
Within the context of an experiment I am designing for my research, I produce a large number of large (length 4M) arrays which are somewhat sparse, and thereby could be stored as scipy.sparse.lil_matrix instances, or simply as scipy.array instances (the space gain/loss isn't the issue here).
Each of these arrays must be paired with a string (namely a word) for the data to make sense, as they are semantic vectors representing the meaning of that string. I need to preserve this pairing.
The vectors for each word in a list are built one-by-one, and stored to disk before moving on to the next word.
They must be stored to disk in a manner which could be then retrieved with dictionary-like syntax. For example if all the words are stored in a DB-like file, I need to be able to open this file and do things like vector = wordDB[word].
CURRENT APPROACH
What I'm currently doing:
Using shelve to open a shelf named wordDB
Each time the vector (currently using lil_matrix from scipy.sparse) for a word is built, storing the vector in the shelf: wordDB[word] = vector
When I need to use the vectors during the evaluation, I'll do the reverse: open the shelf, and then recall vectors by doing vector = wordDB[word] for each word, as they are needed, so that not all the vectors need be held in RAM (which would be impossible).
The above 'solution' fits my needs in terms of solving the problem as specified. The issue is simply that when I wish to use this method to build and store vectors for a large amount of words, I simply run out of disk space.
This is, as far as I can tell, because shelve pickles the data being stored, which is not an efficient way of storing large arrays, thus rendering this storage problem intractable with shelve for the number of words I need to deal with.
PROBLEM
The question is thus: is there a way of serializing my set of arrays which will:
Save the arrays themselves in compressed binary format akin to the .npy files generated by scipy.save?
Meet my requirement that the data be readable from disk as a dictionary, maintaining the association between words and arrays?
as JoshAdel already suggested, I would go for HDF5, the simplest way is to use h5py:
http://h5py.alfven.org/
you can attach several attributes to an array with a dictionary like sintax:
dset.attrs["Name"] = "My Dataset"
where dset is your dataset which can be sliced exactly as a numpy array, but in the background it does not load all the array into memory.
I would suggest to use scipy.save and have an dictionnary between the word and the name of the files.
Have you tried just using cPickle to pickle the dictionary directly using:
import cPickle
DD = dict()
f = open('testfile.pkl','wb')
cPickle.dump(DD,f,-1)
f.close()
Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf if necessary since this allows you to open a large array without bringing it all into memory at once and then get slices as needed. You can then associate the words as an additional group in the netcdf4/hdf5 file and use the common indices to quickly associate the appropriate slice from each group, or just name the group the word and then have the data be the vector. You'd have to play around with which is more efficient.
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html
Pytables also might be a useful storage layer on top of HDF5:
http://www.pytables.org
Avoid using shelve, it's bug ridden and has cross-platform issues.
The memory issue, however, has nothing to do with shelve. Numpy arrays provide efficient implementation of the pickle protocol and there is little memory overhead to cPickle.dumps(protocol=-1), compared to binary .npy (only the extra headers in pickle, basically).
So if binary/pickle isn't enough, you'll have to go for compression. Have a look at pytables or h5py (difference between the two).
If specifying the binary protocol in pickle is enough, you can consider something more lightweight than hdf5: check out sqlitedict for a replacement of shelve. It has no additional dependencies.

HDF5 : storing NumPy data

when I used NumPy I stored it's data in the native format *.npy. It's very fast and gave me some benefits, like this one
I could read *.npy from C code as
simple binary data(I mean *.npy are
binary-compatibly with C structures)
Now I'm dealing with HDF5 (PyTables at this moment). As I read in the tutorial, they are using NumPy serializer to store NumPy data, so I can read these data from C as from simple *.npy files?
Does HDF5's numpy are binary-compatibly with C structures too?
UPD :
I have matlab client reading from hdf5, but don't want to read hdf5 from C++ because reading binary data from *.npy is times faster, so I really have a need in reading hdf5 from C++ (binary-compatibility)
So I'm already using two ways for transferring data - *.npy (read from C++ as bytes,from Python natively) and hdf5 (access from Matlab)
And if it's possible,want to use the only one way - hdf5, but to do this I have to find a way to make hdf5 binary-compatibly with C++ structures, pls help, If there is some way to turn-off compression in hdf5 or something else to make hdf5 binary-compatibly with C++ structures - tell me where i can read about it...
The proper way to read hdf5 files from C is to use the hdf5 API - see this tutorial. In principal it is possible to directly read the raw data from the hdf5 file as you would with the .npy file, assuming you have not used advanced storage options such as compression in your hdf5 file. However this essentially defies the whole point of using the hdf5 format and I cannot think of any advantage to doing this instead of using the proper hdf5 API. Also note that the API has a simplified high level version which should make reading from C relatively painless.
I feel your pain. I've been dealing extensively with massive amounts of data stored in HDF5 formatted files, and I've gleaned a few bits of information you may find useful.
If you are in "control" of the file creation (and writing the data - even if you use an API) you should be able to largely entirely circumvent the HDF5 libraries.
If you the output datasets are not chunked, they will be written contiguously. As long as you aren't specifying any byte-order conversion in your datatype definitions (i.e. you are specifying the data should be written in native float/double/integer format) you should be able to achieve "binary-compatibility" as you put it.
To solve my problem I wrote an HDF5 file parser using the file specification http://www.hdfgroup.org/HDF5/doc/H5.format.html
With a fairly simple parser you should be able to identify the offset to (and size of) any dataset. At that point simply fseek and fread (in C, that is, perhaps there is a higher level approach you can take in C++).
If your datasets are chunked, then more parsing is necessary to traverse the b-trees used to organize the chunks.
The only other issue you should be aware of is handling any (or eliminating) any system dependent structure padding.
HDF5 takes care of binary compatibility of structures for you. You simply have to tell it what your structs consist of (dtype) and you'll have no problems saving/reading record arrays - this is because the type system is basically 1:1 between numpy and HDF5. If you use H5py I'm confident to say the IO should be fast enough provided you use all native types and large batched reads/writes - the entire dataset of allowable. After that it depends on chunking and what filters (shuffle, compression for example) - it's also worth noting sometimes those can speed up by greatly reducing file size so always look at benchmarks. Note that the the type and filter choices are made on the end creating the HDF5 document.
If you're trying to parse HDF5 yourself, you're doing it wrong. Use the C++ and C apis if you're working in C++/C. There are examples of so called "compound types" on the HDF5 groups website.

Categories