Xarray to merge two hdf5 file with different dimension length - python

I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0, which represet the total data points varying with measurement time.
These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.
Here is my attempt:
files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])
## copy a new one to save the merged data array.
d0 = d1
vars_ = [c for c in d1]
for var in vars_:
d0[var].values = np.vstack([d1[var],d2[var]])
The error shows like this:
replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)
I thought about two solution for this problem:
expanding the dimension length to the total length of all merged files.
creating a new empty dataframe in the same format of d1 and d2.
However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.
Supplemental information
dataset example [d1],[d2]

I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.
A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.
There are some things to consider when merging datasets across 2 or more HDF5 files. They are:
Consistency of the data schema (which you have checked).
Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).
I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.
The code below goes through the following steps to merge the data.
Open the new merge file for writing
Open the first data file (read-only)
Loop thru all data sets
a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
Open the second data file (read-only)
Loop thru all data sets and do the following:
a. get the size of the 2 datasets (existing and to be added)
b. increase the size of HDF5 dataset with .resize() method
c. write values from dataset to end of existing dataset
At the end it loops thru all 3 files and prints shape and
maxshape for all datasets (for visual comparison).
Code below:
import h5py
files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]
# Create the merge file:
with h5py.File('merged_.h5','w') as h5fw:
# Open first HDF5 file and copy each dataset.
# Will use maxhape and attributes from existing dataset.
with h5py.File(files[0],'r') as h5fr:
for ds in h5fr.keys():
h5fw.copy(h5fr[ds], h5fw, name=ds)
# Open second HDF5 file and copy data from each dataset.
# Resizes existing dataset as needed to hold new data.
with h5py.File(files[1],'r') as h5fr:
for ds in h5fr.keys():
ds_a0 = h5fw[ds].shape[0]
add_a0 = h5fr[ds].shape[0]
h5fw[ds].resize(ds_a0+add_a0,axis=0)
h5fw[ds][ds_a0:] = h5fr[ds][:]
for fname in files:
print(f'Working on file:{fname}')
with h5py.File(fname,'r') as h5f:
for ds, h5obj in h5f.items():
print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')

Related

MATLAB/Python: How can I load large files individually into an existing dataframe to train a classifier?

I am currently data wrangling on a very new project, and it is proving a challenge.
I have EEG data that has been preprocessed in eeglab in MATLAB, and I would like to load it into python to use it to train a classifier. I also have a .csv file with the subject IDs of each individual, along with a number (1, 2 or 3) corresponding to which third of the sample they are in.
Currently, I have the data saved as .mat files, one for each individual (104 in total), each containing an array shaped 64x2000x700 (64 channels, 2000 data points per 2 second segment (sampling frequency of 1000Hz), 700 segments). I would like to load each participant's data into the dataframe alongside their subject ID and classification score.
I tried this:
all_files = glob.glob(os.path.join(path, "*.mat"))
lang_class= pd.read_csv("TestLangLabels.csv")
df_dict = {}
for file in all_files:
file_name = os.path.splitext(os.path.basename(file))[0]
df_dict[file]
df_dict[file_name]= loadmat(file,appendmat=False)
# Setting the file name (without extension) as the index name
df_dict[file_name].index.name = file_name
But the files are so large that this maxes out my memory and doesn't complete.
Then, I attempted to loop it using pandas using the following:
main_dataframe = pd.DataFrame(loadmat(all_files[0]))
for i in range(1,len(all_files)):
data = loadmat(all_files[i])
df = pd.DataFrame(data)
main_dataframe = pd.concat([main_dataframe,df],axis=1)
At which point I got the error:
ValueError: Data must be 1-dimensional
Is there a way of doing this that I am overlooking, or will downsampling be inevitable?
subjectID
Data
Class
AA123
64x2000x700
2
I believe that something like this could then be used as a test/train dataset for my model, but welcome any and all advice!
Thank you in advance.
Is there a reason you have such a high sampling rate? I don't believe Ive heard a compelling reason to go over 512hz and normally take it down to 256hz. I don't know if it matters for ML, but most other approach really don't need that. Going from 1000hz to 500hz or even 250hz might help.

Reading Large HDF5 Files

I am new to using HDF5 files and I am trying to read files with shapes of (20670, 224, 224, 3). Whenever I try to store the results from the hdf5 into a list or another data structure, it takes either takes so long that I abort the execution or it crashes my computer. I need to be able to read 3 sets of hdf5 files, use their data, manipulate it, use it to train a CNN model and make predictions.
Any help for reading and using these large HDF5 files would be greatly appreciated.
Currently this is how I am reading the hdf5 file:
db = h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5")
training_db = list(db['data'])
Crashes probably mean you are running out of memory. Like Vignesh Pillay suggested, I would try chunking the data and work on a small piece of it at a time. If you are using the pandas method read_hdf you can use the iterator and chunksize parameters to control the chunking:
import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
#train cnn on chunk here
print(chunk.shape)
Note this requires the hdf to be in table format
My answer updated 2020-08-03 to reflect code you added to your question.
As #Tober noted, you are running out of memory. Reading a dataset of shape (20670, 224, 224, 3) will become a list of 3.1G entities. If you read 3 image sets, it will require even more RAM.
I assume this is image data (maybe 20670 images of shape (224, 224, 3) )?
If so, you can read the data in slices with both h5py and tables (Pytables).
This will return the data as a NumPy array, which you can use directly (no need to manipulate into a different data structure).
Basic process would look like this:
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
training_db = db['data']
# loop to get images 1 by 1
for icnt in range(20670) :
image_arr = training_db [icnt,:,:,:}
# then do something with the image
You could also read multiple images by setting the first index to a range (say icnt:icnt+100) then handle looping appropriately.
Your problem is arising as you are running out of memory. So, Virtual Datasets come in handy while dealing with large datasets like yours. Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. You can read more about them here https://docs.h5py.org/en/stable/vds.html
I would recommend you to start from one file at a time. Firstly, create a Virtual Dataset file of your existing data like
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
data_shape = db['data'].shape
layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
vsource = h5py.VirtualSource(db['data'])
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
file.create_virtual_dataset('data', layout = layout, fillvalue = 0)
This will create a virtual dataset of your existing training data. Now, if you want to manipulate your data, you should open your file in r+ mode like
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
# Do whatever manipulation you want to do here
One more thing I would like to advise is make sure your indices while slicing are of int datatype, otherwise you will get an error.

Reading the last batch of data added to hdfs file using Python

I have a program that will add a variable number of rows of data to an hdf5 file as shown below.
data_without_cosmic.to_hdf(new_file,key='s', append=True, mode='r+', format='table')
New_file is the file name and data_without_cosmic is a pandas data frame with 'x' , 'y', 'z', and 'i' columns representing positional data and a scalar quantity. I may add several data frames of this form to the file each time I run the full program. For each data frame I add, the 'z' values are a constant value.
The next time I use the program, I would need to access the last batch of rows that was added to the data in order to perform some operations. I wondered if there was a fast way to retrieve just the last data frame that was added to the file or if I could group the data in some way as I add it in order to be able to do so.
The only other way I can think of achieving my goal is by reading the entire file and then checking the z values from bottom up until it changes, but that seemed a little excessive. Any ideas?
P.S I am very inexperienced with working with hdf5 files but I read that they are efficient to work with.

How to copy a partial or skeleton h5py file

I have a few questions wrapped up into this issue. I realize this might be a convoluted post and can provide extra details.
A code package I use can produce large .h5 files (source.h5) (100+ Gb), where almost all of this data resides in 1 dataset (group2/D). I want to make a new .h5 file (dest.h5) using Python that contains all datasets except group2/D of source.h5 without needing to copy the entire file. I then will condense group2/D after some postprocessing and write a new group2/D in dest.h5 with much less data. However, I need to keep source.h5 because this postprocessing may need to be performed multiple times into multiple destination files.
source.h5 is always structured the same and cannot be changed in either source.h5 or dest.h5, where each letter is a dataset:
group1/A
group1/B
group2/C
group2/D
I thus want to initially make a file with this format:
group1/A
group1/B
group2/C
and again, fill in group2/D later. Simply copying source.h5 multiple times is always possible, but I'd like to avoid having to copy a huge file a bunch of times because disk space is limited and this is something that isn't a 1 off case.
I searched and found this question (How to partially copy using python an Hdf5 file into a new one keeping the same structure?) and tested if dest.h5 would be the same as source.h5:
fs = h5py.File('source.h5', 'r')
fd = h5py.File('dest.h5', 'w')
fs.copy('group1', fd)
fd.create_group('group2')
fs.copy('group2/C', fd['/group2'])
fd.copy('group2/D', fd['/group2'])
fs.close()
fd.close()
but the code package I used couldn't read the file I created (which I must have happen), implying there was some critical data loss when I did this operation (the file sizes differ by 7 kb also). I'm assuming the problem was when I created group2 manually because I checked with numpy that the values in group1 datasets exactly matched in both source.h5 and dest.h5. Before I did any digging into what data is missing I wanted to get a few things out of the way:
Question 1: Is there .h5 file metadata that accompanies each group or dataset? If so, how can I see it so I can create a group2 in dest.h5 that exactly matches the one in source.h5? Is there a way to see if 2 groups (not datasets) exactly match each other?
Question 2: Alternatively, is it possible to simply copy the data structure of a .h5 file (i.e. groups and datasets with empty lists as a skeleton file) so that fields can be populated later? Or, as a subset of this question, is there a way to copy a blank dataset to another file such that any metadata is retained (assuming there is some)?
Question 3: Finally, to avoid all this, is it possible to just copy a subset of source.h5 to dest.h5? With something like:
fs.copy(['group1','group2/C'], fd)
Thanks for your time. I appreciate you reading this far

Using MFDataset to combine netcdf files in python

I am trying to combine netcdf files, but it contifuously shows
" File "CBL_plot.py", line 11, in f = MFDataset(fili) File "utils.pyx", line 274, in netCDF4.MFDataset.init (netCDF4.c:3822) IOError: master dataset THref_11:00.nc does not have a aggregation dimension."
So, I checked only one netcdf files and the information of a netcdf file is as below:
float64 th_ref(u't',)
unlimited dimensions = ()
current size = (30,)
It looks there is no aggregation dimension. However, I would like to combine those netcdf files rather than just using one by one.
Is there any way to create aggregation dimension to make this MFData set work?
Below is the python code I used:
import numpy as np
from netCDF4 import MFDataset
varn = 'th_ref'
fili = THref_*nc'
f = MFDataset(fili)
Th = f.variables[varn]
Th_ref=np.array(Th[:])
print Th.shape
I will really appreciate any help, idea, and hint.
Thank you,
Isaac
Short answer: MFDataset can only aggregate along the slowest varying dimension in your files.
Longer answer: In the netcdf4-python documentation of MFDataset it says "Open a Dataset spanning multiple files, making it look as if it was a single file. Variables in the list of files that share the same dimension (specified with the keyword aggdim) are aggregated. If aggdim is not specified, the unlimited is aggregated. Currently, aggdim must be the leftmost (slowest varying) dimension of each of the variables to be aggregated."
So MFDataset works by aggregating along the slowest varying dimension in the existing files. So if you have a bunch of files that are snapshots of the same logical dataset at different times, and you want to aggregate in time, you need to have a time dimension in each of the files. If the time of the data is simply encoded in the file name, there is currently no way to use MFDataset to aggregate.

Categories