Read multiple datasets from same Group in h5 file using h5py

Read multiple datasets from same Group in h5 file using h5py - python

I have several groups in my h5 file: 'group1', 'group2', ... and each group has 3 different datasets: 'dataset1', 'dataset2', 'dataset3', all of which are arrays with numerical values but the size of array is different.
My goal is to save each dataset from group to a numpy array.
Example:
import h5py
filename = '../Results/someFileName.h5'
data = h5py.File(filename, 'r')
Now I can easily iterate over all groups with
for i in range(len(data.keys())):
group = list(data.keys())[i]
but I can't figure out how to access the datasets within the group. So I am looking for something like MATLAB:
hinfo = h5info(filename);
for i = 1:length(hinfo.Groups())
datasetname = [hinfo.Groups(i).Name '/dataset1'];
dset = h5read(fn, datasetname);
Where dset is now an array of numbers.
Is there a way I could do the same with h5py?

You are have the right idea.
But, you don't need to loop on range(len(data.keys())).
Just use data.keys(); it generates an iterable list of object names.
Try this:
import h5py
filename = '../Results/someFileName.h5'
data = h5py.File(filename, 'r')
for group in data.keys() :
print (group)
for dset in data[group].keys():
print (dset)
ds_data = data[group][dset] # returns HDF5 dataset object
print (ds_data)
print (ds_data.shape, ds_data.dtype)
arr = data[group][dset][:] # adding [:] returns a numpy array
print (arr.shape, arr.dtype)
print (arr)
Note: logic above is valid ONLY when there are only groups at the top level (no datasets). It does not test object types as groups or data sets.
To avoid these assumptions/limitations, you should investigate .visititems() or write a generator to recursively visit objects. The first 2 answers are examples showing .visititems() usage, and the last 1 uses a generator function:
Use visititems(-function-) to loop recursively
This example uses isinstance() as the test. The object is a Group when it tests true for h5py.Group and is a Dataset when it tests true for h5py.Dataset . I consider this more Pythonic than the second example below (IMHO).
Convert hdf5 to raw organised in folders
It checks for number of objects below the visited object. when there are no subgroups, it is a dataset. And when there subgroups, it is a group.
How can I combine multiple .h5 file? This quesion has multipel answers. This answer uses a generator to merge data from several files with several groups and datasets into a single file.

This method requires that dataset names, 'dataset1', 'dataset2', 'dataset3', etc., be the same in each of the hdf5 groups of one hdf5 file.
# create empty lists
lat = []
lon = []
x = []
y = []
# fill lists creating numpy arrays
h5f = h5py.File('filename.h5', 'r') # read file
for group in h5f.keys(): # iterate through groups
for datasets in h5f[group].keys(): #iterate through datasets
lat = np.append(lat, h5f[group]['lat'][()]) # append data
lon = np.append(lon, h5f[group]['lon'][()])
x = np.append(x, h5f[group]['x'][()])
y = np.append(y, h5f[group]['y'][()])

Related

How can I save my results in the same file as different columns in case of a 'for-cylce'

def get_df():
df = pd.DataFrame()
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
av_a = np.average(a, axis=0)
np.savetxt('merged_average.csv', av_a, delimiter=',')
I've tried to save it but it always overwrites with the next file and deletes the previous results

At the moment, your code is a bit hard to read, as you are declaring variables which are not used (df) and using variables which are not declared (a). In the future, try to give a minimal reproducible example of your problematic code.
I'll still try to give you an interpreted answer:
If you want to store multiple columns from different files next to each other, the job becomes simpler by first acquiring all columns, and then afterwardds save them to the file in a single action.
Here is an interpretation of your code:
def get_df():
# create an empty list to collect all results
average_results = []
os.chdir("C:/Users/s/Desktop/P")
for file in os.listdir():
if file.endswith('.csv'):
a = something(file) # unknown to me
average_results.append(np.average(a, axis=0))
# convert the results to a 2d numpy matrix,
# optionally transpose it to get the desired data orientation
data = np.array(average_results).transpose()
# save the full dataset
np.savetxt('merged_average.csv', data , delimiter=',')

How can I open large (2GB) files and precompute it quicker?

This is the code I am currently using to open this very large matlab file:
self.data = {}
f = h5py.File(filepath, 'r')
for k, v in f.items():
self.data[k] = np.array(v)
self.data = list(self.data.items())
self.data = np.array(self.data)
self.fs = self.data[1][1][0][0]
self.data = self.data[0][1]
print('fs = ', self.fs)
print('DONE. Data read using h5py reader')
It takes about 3-5 minutes to load fully. How can I improve this code so I can speed up the process?

Saving and using objects is easy. It's hard to provide specifics about objects vs arrays without knowing the purpose of your code. What do you want to do with variable self.data after you get it? You retrieve all of the datasets, then keep redefining self.data. At the end, you assign self.data to an array for the 1st dataset. Also, self.fs points to a single value from the 2nd dataset. So, it's not clear why you retrieve all of them.
If you only want the data from the first dataset, something like this will do the job.
f = h5py.File(filepath, 'r')
# returns first key from group member names:
dsname = next(iter(f)) # (name of 1st dataset)
self.data = f[dsname][:] # (data from 1st dataset)
If you want the name and data from all the datasets, this will return a dictionary of names and objects.
f = h5py.File(filepath, 'r')
for k, v in f.items():
self.data[k] = v
Then, when you need an array of data, use one of these methods:
arr = data['a dataset name'][:] # for 1 dataset (when you know the name)
for dsname, obj in data.items(): # loop thru all pairs in data
arr = obj[:] # or
arr = f[dsname][:]
To understand what's going on, let's start with some HDF5 basics. It's critically important to understand the schema when working with HDF5. The data structure is similar to folders and files on a computer. The Folders are "Groups" and the Files are "Datasets". With h5py, you access Groups members using dictionary syntax, and you access Datasets with NumPy array syntax. (The h5py File object behaves like a Group.)
So, when you execute for k,v in f.items(), h5py returns key,value pairs for each group member (k,v in your code). The key is the object's name and the value is a h5py object (either a dataset or another group).
Once you have a dataset, there are 3 ways to access the associated data:
You can simply save the object and reference it "as-if" it is a NumPy
array.
You can save the object's name, then when you need the object's data, you retrieve using the dataset name and associated group.
You can read the dataset values into a NumPy array. (This is what your code does.)
Note: I prefer Methods #1 and #2 because they do not load the data into memory until you read the data.
Here is your code with my comments about each step:
# returns a file object "f" (behaves like a group):
f = h5py.File(filepath, 'r')
# returns key/value pairs of group member names/objects:
for k, v in f.items():
# adds a dictionary key (object name) with value of an array read from object "v":
self.data[k] = np.array(v)
# converts the (name,array) dictionary into a list:
self.data = list(self.data.items())
# converts the list into a nd.array of dtype=object:
# -has 1 row for each dataset with 2 columns:
# -column 0: object name, column 1: array of data
self.data = np.array(self.data)
# reads [0][0] value from array at self.data[1][1] ([0][0] value for 2nd dataset)
self.fs = self.data[1][1][0][0]
# resets data to value from array at self.data[0][1] (array for 1st dataset)
self.data = self.data[0][1]

Problem either with number of characters exceeding cell limit, or storing lists of variable length

The problem:
I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.
I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.
Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.
My code:
import csv
import pandas as pd
import math
import numpy as np
#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]
all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]
data = []
for tissue in tissuelist:
#Create array to keep track of the protein mRnaS in tissue that are not present in the network
#initiate with first tissue, protein
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
datatis = [tissue, tissueExpression.tolist()]
print(datatis)
data.append(datatis)
print(data)
df = pd.DataFrame(data)
df.to_csv(r'tissue_expression_data.csv')
Link to data (either one):
https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv
https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv

IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:
import pandas as pd
df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
df = df.drop(columns='Gene ID').set_index('Gene Name')
res = pd.DataFrame()
for c in df.columns:
res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
res.to_csv('E-MTAB-5214-query-results.csv', index=False)
(Writing them as rows would have been easier, but Excel can't import so many columns)
Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).

create your data variable as a dictionary
you can save the dictionary to a json file using json.dump refer here
import json
data = {}
for tissue in tissuelist:
nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]
tissueExpression = np.delete(all_genes, nanInd)
data[tissue] = tissueExpression.tolist()
with open('filename.json', 'w') as fp:
json.dump(data, fp)

load matlab tables in python using scipy.io.loadmat

Is it possible to load matlab tables in python using scipy.io.loadmat?
What I'm doing:
In Matlab:
tab = table((1:500)')
save('tab.mat', 'tab')
In Python:
import scipy.io
mat = scipy.io.loadmat('m:/tab.mat')
But I cannot access the table tab in Python using mat['tab']

The answer to your question is no. Many matlab objects can be loaded in python. Tables, among others, can not be loaded. See Handle Data Returned from MATLAB to Python

The loadmat function doesn't load MATLAB tables. Instead a small workaround can be done. The tables can be saves as .csv files which can then be read using pandas.
In MATLAB
writetable(table_name, file_name)
In Python
df = pd.read_csv(file_name)
At the end, the DataFrame df will have the contents of table_name

I've looked into this for a project I'm working on, and as a workaround, you could try the following.
In MATLAB, first convert the #table object into a struct, and retrieve the column names using:
table_struct = struct(table_object);
table_columns = table_struct.varDim.labels;
save table_as_struct table_struct table_columns;
And then you can try the following code in python:
import numpy
import pandas as pd
import scipy.io
# function to load table variable from MAT-file
def loadtablefrommat(matfilename, tablevarname, columnnamesvarname):
"""
read a struct-ified table variable (and column names) from a MAT-file
and return pandas.DataFrame object.
"""
# load file
mat = scipy.io.loadmat(matfilename)
# get table (struct) variable
tvar = mat.get(tablevarname)
data_desc = mat.get(columnnamesvarname)
types = tvar.dtype
fieldnames = types.names
# extract data (from table struct)
data = None
for idx in range(len(fieldnames)):
if fieldnames[idx] == 'data':
data = tvar[0][0][idx]
break;
# get number of columns and rows
numcols = data.shape[1]
numrows = data[0, 0].shape[0]
# and get column headers as a list (array)
data_cols = []
for idx in range(numcols):
data_cols.append(data_desc[0, idx][0])
# create dict out of original table
table_dict = {}
for colidx in range(numcols):
rowvals = []
for rowidx in range(numrows):
rowval = data[0,colidx][rowidx][0]
if type(rowval) == numpy.ndarray and rowval.size > 0:
rowvals.append(rowval[0])
else:
rowvals.append(rowval)
table_dict[data_cols[colidx]] = rowvals
return pd.DataFrame(table_dict)

Based on Jochens answer i propose a different variant that does a good job for me.
I wrote a Matlab Script to prepare the m-file automatically (see my GitLab Repositroy with examples).
It does the following:
In Matlab for class table:
Does the same like Jochens example, but binds the data together. So it is easier to load multiple variables. The names "table" and "columns" are mandatory for the next part.
YourVariableName = struct('table', struct(TableYouWantToLoad), 'columns', {struct(TableYouWantToLoad).varDim.labels})
save('YourFileName', 'YourVariableName')
In Matlab for class dataset:
Alternative, if you have to handle the old dataset type.
YourVariableName = struct('table', struct(DatasetYouWantToLoad), 'columns', {get(DatasetYouWantToLoad,'VarNames')})
save('YourFileName', 'YourVariableName')
In Python:
import scipy.io as sio
mdata = sio.loadmat('YourFileName')
mtable = load_table_from_struct(mdata['YourVariableName'])
with
import pandas as pd
def load_table_from_struct(table_structure) -> pd.DataFrame():
# get prepared data structure
data = table_structure[0, 0]['table']['data']
# get prepared column names
data_cols = [name[0] for name in table_structure[0, 0]['columns'][0]]
# create dict out of original table
table_dict = {}
for colidx in range(len(data_cols)):
table_dict[data_cols[colidx]] = [val[0] for val in data[0, 0][0, colidx]]
return pd.DataFrame(table_dict)
It is independent from loading the file, but basically a minimized versions of Jochens Code. So please give him kudos for his post.

As others have mentioned, this is currently not possible, because Matlab has not documented this file format. People are trying to reverse engineer the file format but this is a work in progress.
A workaround is to write the table to CSV format and to load that using Python. The entries in the table can be variable length arrays and these will be split across numbered columns. I have written a short function to load both scalars and arrays from this CSV file.
To write the table to CSV in matlab:
writetable(table_name, filename)
To read the CSV file in Python:
def load_matlab_csv(filename):
"""Read CSV written by matlab tablewrite into DataFrames
Each entry in the table can be a scalar or a variable length array.
If it is a variable length array, then Matlab generates a set of
columns, long enough to hold the longest array. These columns have
the variable name with an index appended.
This function infers which entries are scalars and which are arrays.
Arrays are grouped together and sorted by their index.
Returns: scalar_df, array_df
scalar_df : DataFrame of scalar values from the table
array_df : DataFrame with MultiIndex on columns
The first level is the array name
The second level is the index within that array
"""
# Read the CSV file
tdf = pandas.read_table(filename, sep=',')
cols = list(tdf.columns)
# Figure out which columns correspond to scalars and which to arrays
scalar_cols = [] # scalar column names
arr_cols = [] # array column names, without index
arrname2idxs = {} # dict of array column name to list of integer indices
arrname2colnames = {} # dict of array column name to list of full names
# Iterate over columns
for col in cols:
# If the name ends in "_" plus space plus digits, it's probably
# from an array
if col[-1] in '0123456789' and '_' in col:
# Array col
# Infer the array name and index
colsplit = col.split('_')
arr_idx = int(colsplit[-1])
arr_name = '_'.join(colsplit[:-1])
# Store
if arr_name in arrname2idxs:
arrname2idxs[arr_name].append(arr_idx)
arrname2colnames[arr_name].append(col)
else:
arrname2idxs[arr_name] = [arr_idx]
arrname2colnames[arr_name] = [col]
arr_cols.append(arr_name)
else:
# Scalar col
scalar_cols.append(col)
# Extract all scalar columns
scalar_df = tdf[scalar_cols]
# Extract each set of array columns into its own dataframe
array_df_d = {}
for arrname in arr_cols:
adf = tdf[arrname2colnames[arrname]].copy()
adf.columns = arrname2idxs[arrname]
array_df_d[arrname] = adf
# Concatenate array dataframes
array_df = pandas.concat(array_df_d, axis=1)
return scalar_df, array_df
scalar_df, array_df = load_matlab_csv(filename)

How to overwrite array inside h5 file using h5py

I'm trying to overwrite a numpy array that's a small part of a pretty complicated h5 file.
I'm extracting an array, changing some values, then want to re-insert the array into the h5 file.
I have no problem extracting the array that's nested.
f1 = h5py.File(file_name,'r')
X1 = f1['meas/frame1/data'].value
f1.close()
My attempted code looks something like this with no success:
f1 = h5py.File(file_name,'r+')
dset = f1.create_dataset('meas/frame1/data', data=X1)
f1.close()
As a sanity check, I executed this in Matlab using the following code, and it worked with no problems.
h5write(file1, '/meas/frame1/data', X1);
Does anyone have any suggestions on how to do this successfully?

You want to assign values, not create a dataset:
f1 = h5py.File(file_name, 'r+') # open the file
data = f1['meas/frame1/data'] # load the data
data[...] = X1 # assign new values to data
f1.close() # close the file
To confirm the changes were properly made and saved:
f1 = h5py.File(file_name, 'r')
np.allclose(f1['meas/frame1/data'].value, X1)
#True

askewchan's answer describes the way to do it (you cannot create a dataset under a name that already exists, but you can of course modify the dataset's data). Note, however, that the dataset must have the same shape as the data (X1) you are writing to it. If you want to replace the dataset with some other dataset of different shape, you first have to delete it:
del f1['meas/frame1/data']
dset = f1.create_dataset('meas/frame1/data', data=X1)

Different scenarios:
Partial changes to dataset
with h5py.File(file_name,'r+') as ds:
ds['meas/frame1/data'][5] = val # change index 5 to scalar "val"
ds['meas/frame1/data'][3:7] = vals # change values of indices 3--6 to "vals"
Change each value of dataset (identical dataset sizes)
with h5py.File(file_name,'r+') as ds:
ds['meas/frame1/data'][...] = X1 # change array values to those of "X1"
Overwrite dataset to one of different size
with h5py.File(file_name,'r+') as ds:
del ds['meas/frame1/data'] # delete old, differently sized dataset
ds.create_dataset('meas/frame1/data',data=X1) # implant new-shaped dataset "X1"
Since the File object is a context manager, using with statements is a nice way to package your code, and automatically close out of your dataset once you're done altering it. (You don't want to be in read/write mode if you only need to read off data!)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read multiple datasets from same Group in h5 file using h5py - python

Related

How can I save my results in the same file as different columns in case of a 'for-cylce'

How can I open large (2GB) files and precompute it quicker?

Problem either with number of characters exceeding cell limit, or storing lists of variable length

load matlab tables in python using scipy.io.loadmat

How to overwrite array inside h5 file using h5py

Categories

Resources