How to aggregate multiple hdf5 files into a one image

How to aggregate multiple hdf5 files into a one image - python

I have several (up to few hundreds) hdf5 files, which contain the results of a parallel simulation: each compute node creates an independent hdf5 file to avoid any synchronization problems.
Is there any way to create an 'image' of all data in hdf5 files, such what, this 'image' would look like it has all data, but in reality, it will provide data from other files?
Here what I'm looking for:
"data-node0.h5"
spike/PopulationA -> pandas data frame columns=[0,3,6,9]
"data-node1.h5"
spike/PopulationA -> pandas data frame columns=[1,4,7,10]
"data-node2.h5"
spike/PopulationA -> pandas data frame columns=[2,5,8,11]
spike/PopulationB -> pandas data frame columns=[0,1,2,3]
"data.h5" = aggregate("data-node0.h5","data-node1.h5","data-node2.h5")
"data.h5"
spike/PopulationA -> pandas data frame columns=[0,1,2,3,4,5,6,7,8,9,10,11]
spike/PopulationB -> pandas data frame columns=[0,1,2,3]
Note that file data.h5 doesn't contain any data. It uses data from data-nodeX.h5 files.
Update Data in hdf5 files are pandas data frames with time series. The column in each data frame is 1D numpy array recorded from an object in the model. A column identifier is a unique ID of an object in the model. The table index is the model time in ms.

In version 1.10+, HDF5 added a virtual dataset feature that allows you to map data from multiple datasets into a top-level 'virtual' dataset, which stores no data itself.
The documentation is here:
https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesVirtualDatasetDocs.html
The complication, of course, is that it looks like you are using PyTables via Pandas and not raw HDF5. PyTables is HDF5, but adds a layer of structure and semantics on top of HDF5's groups and datasets. In order to create a virtual dataset based on PyTables, you are going to have to dig around in the sub-structure of the PyTables HDF5 objects to set up the mapping. Also, any virtual dataset you create will be a regular HDF5 dataset and not a PyTables table. This is certainly doable given a basic knowledge of HDF5, though possibly more work than you hoped.
h5py (a much lower-level and more direct Python wrapper for HDF5) has support for the virtual dataset feature, btw, so you can still do everything in Python, just not via PyTables.

Related

Exporting a csv of pandas dataframe with LARGE np.arrays

I'm building a deep learning model for speech emotion recognition in google colab environment.
The process of the data and features extraction from the audio files is taking about 20+ mins of runtime.
Therefore, I have made a pandas DataFrame containing all of the data which I want to export to a CSV file so I wouldn't need to wait that long for the data to be extracted every time.
Because audio files have 44,100 frames per second on average (sample rate (Hz)), I get a huge array of values, so that
df.sample shows for e.g:
df.sample for variable 'x'
Each 'x' array has about 170K values, but only shows this minimizing representation in df.sample.
Unfortunately, df.to_csv copies the exact representation, and NOT the full arrays.
Is there a way to export the full DataFrame as CSV? (Should be miles and miles of data for each row...)

The problem is that a dataframe is not expected to contain np.arrays. As numpy is the underlying framework for Pandas, np.arrays are special for pandas. Anyway, a dataframe is intended to be a data processing tools, not a general purpose container, so I think you are using the wrong tool here.
If you still want to go that way, it is enough to change the np.arrays into lists:
df['x'] = df['x'].apply(list)
But at load time, you will have to declare a converter to change the string representations of lists into plain lists:
df = pd.read_csv('data.csv', converters={'x': ast.literal_eval, ...})
But again, a csv file is not intended to have fields containing large lists, and performances could not be what you expect.

Large python dictionary. Storing, loading, and writing to it

I have a large python dictionary of values (around 50 GB), and I've stored it as a JSON file. I am having efficiency issues when it comes to opening the file and writing to the file. I know you can use ijson to read the file efficiently, but how can I write to it efficiently?
Should I even be using a Python dictionary to store my data? Is there a limit to how large a python dictionary can be? (the dictionary will get larger).
The data basically stores the path length between nodes in a large graph. I can't store the data as a graph because searching for a connection between two nodes takes too long.
Any help would be much appreciated. Thank you!

Although it will truly depend on what operations you want to perform on your network dataset you might want to considering storing this as a pandas Dataframe and then write it to disk using Parquet or Arrow.
That data could then be loaded to networkx or even to Spark (GraphX) for any network related operations.
Parquet is compressed and columnar and makes reading and writing to files much faster especially for large datasets.
From the Pandas Doc:
Apache Parquet provides a partitioned binary columnar serialization
for data frames. It is designed to make reading and writing data
frames efficient, and to make sharing data across data analysis
languages easy. Parquet can use a variety of compression techniques to
shrink the file size as much as possible while still maintaining good
read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame
s, supporting all of the pandas dtypes, including extension dtypes
such as datetime with tz.
Read further here: Pandas Parquet

try to use it with pandas: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
Convert a JSON string to pandas object
it very lightweight and useful library to work with large data

Dimensions of data stored in HDF5

I have several .h5 files which contain Pandas DataFrames created with the .to_hdf method. My question is quite simple : is it possible to retrieve the dimension of the DataFrame stored in the .h5 file without loading all the data in RAM ?
Motivation : the DataFrames stored in those HDF5 files are quite big (up to several Gb) and loading all the data just to get the shape of the data is really time consuming.

You are probably going to want to use PyTables directly.
The API reference is here, but basically:
from tables import *
h5file = open_file("yourfile.h5", mode="r")
print h5file.root.<yourdataframe>.table.shape
print len(h5file.root.<yourdataframe>.table.cols) - 1 # first col is an index
Also, just for clarity, HDF5 does not read all the data when a dataset is opened. That would be a peculiarity of Pandas.

dask / pandas categorical transformation differences

I am managing larger than memory csv files of mostly categorical data. Initially I used to create a large csv file, then read it via Pandas read_csv, convert to categorical and save into hdf5. Once into categorical format, it nicely fits in memory.
Files are growing and I moved to Dask. Same process though.
However, in empty fields, Pandas seems to use np.nan and the category is not included in the cat.categories listing.
With Dask, empty values are filled with NaN, it is included as a separate category and whence saved into HDF I get future compatibility warning.
Is this a bug or am I missing any steps ? Behaviour seems to differ between pandas and dask.
Thanks
JC

This is solved in dask ver 0.11.1
See https://github.com/dask/dask/pull/1578

Store Numpy as pickled Pandas, Pickled Numpy or HDF5

I right now working with 300 float features coming from a preprocessing of item information. Such items are identified by a UUID (i.e. a string). The current file size is around 200MB. So far I have stored them as Pickled numpyarrays. Sometimes I need to map the UUID for an item to a Numpy row. For that I am using a dictionary (stored as json) that maps UUID to row in a numpy array.
I was tempted to use Pandas and replace that dictionary for a Pandas index. I also discovered the HF5 file format but I would like to know a bit more when to use each of them.
I use part of the array to feed a scikit-Learn based algorithm and then to perform classification on the rest.

Storing pickled numpy arrays is indeed not an optimal approach. Instead, you can use,
numpy.savez to save a dictionary of numpy array in a binary format
store pandas DataFrame in HDF5
directly use PyTables to write your numpy arrays to HDF5.
HDF5 is a preferred format to store scientific data that includes, among others,
parallel read/write capabilities
on the fly compression algorithms
efficient querying
ability to work with large datasets that don't fit in the RAM.
Although, the choice of the output file format to store a small dataset of 200MB is not that critical and is more a matter of convenience.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to aggregate multiple hdf5 files into a one image - python

Related

Exporting a csv of pandas dataframe with LARGE np.arrays

Large python dictionary. Storing, loading, and writing to it

Dimensions of data stored in HDF5

dask / pandas categorical transformation differences

Store Numpy as pickled Pandas, Pickled Numpy or HDF5

Categories

Resources