h5py: how to use keys() loop over HDF5 Groups and Datasets - python

print(list(file.keys()))
When I run this code I get:
T00000000,T00000001,T00000002,T00000003, ... ,T00000474
Now, I analized T00000000 but I want to scan them all with for loop. I couldn't do because this is a string. Is there any way to do this?

#python_student, there is more to this than explained in the initial answer. Based on the syntax of you question, it appears you are using h5py to read the HDF5 file. To effectively access the file contents, you need a basic understanding of HDF5 and h5py. I suggest starting here: h5py Quick Start Guide. In addition, there are many good questions and answer here on StackOverflow with details and examples.
An HDF5 file has 2 basic objects:
Datasets: array-like collections of data
Groups: folder-like containers that hold datasets and other groups
h5py, uses dictionary syntax to access Group objects, and reads Datasets using NumPy syntax. (Note group objects are not Python dictionaries - just just "look" like them!)
As you noted, the keys() are the NAMES of the objects (groups or datasets) at the root level of your file. Your code created a list from the group keys: list(file.keys()). In general there is no reason to do this. Typically, you will iterate over the keys() or items() instead of creating a list.
Here is a short code segment to show how you might do this. I can add more details once I know more about your data schema. (HDF5 is a general data container and have most any schema.)
# loop on names:
for name in file.keys():
print(name)
# loop on names and H5 objects:
for name, h5obj in file.items():
if isinstance(h5obj,h5py.Group):
print(name,'is a Group')
elif isinstance(h5obj,h5py.Dataset):
print(name,'is a Dataset')
# return a np.array using dataset object:
arr1 = h5obj[:]
# return a np.array using dataset name:
arr2 = file[name][:]
# compare arr1 to arr2 (should always return True):
print(np.array_equal(arr1, arr2))

Yes, you can use the split() method.
If the string is "T00000000,T00000001,T00000002,T00000003, ... ,T00000474", you can use split to turn it on a list like this:
string = "T00000000,T00000001,T00000002,T00000003, ... ,T00000474"
values = string.split(",")
So, the list values becomes ["T00000000", "T00000001","T00000003", ... ,"T000000474"].
Then you can use this in a for loop.
If you don't wanto to create a list, you can simply:
for value in string.split(","):
#Your code here...
The for loop will be execute with the values T00000000, T00000001, T00000003 ...

Related

Dynamically generate array elements from yaml file in Python

Given the following yaml file stored in my_yaml that contains varying sets of dictionary keys and/or class variables (denoted by self._*):
config1.json:
- [[foo, bar], [hello, world]]
config2.json:
- [[foo], [self._hi]]
From the json file, I want to populate a new list of tuples. The items in each tuple are determined by looking up dict keys in this yaml file.
So if I iterate through a dictionary called config1.json, and I have an empty list called config_list, I want to do something like:
config_list.append(tuple[i['foo']['bar],i['hello']['world']])
But if it were config2.json, I want to do something like:
config_list.append(tuple[i['foo'],self._hi])
I can do this in a less dynamic way:
for i in my_yaml['config1.json'][0]:
config_list.append(tuple([ i[my_yaml[asset][0][0]][my_yaml[asset][0][1]],i[my_yaml[asset][1][0]][my_yaml[asset][1][1]]]))
or:
for i in my_yaml['config2.json'][0]:
config_list.append(tuple([ i[my_yaml[asset][0][0]],i[my_yaml[asset][1][0]]]))
Instead I would like to dynamically generate the contents of config_list
Any ideas or alternatives would be greatly appreciated.
I think you are bit confusing things, first of all because you are referring
to a file in "From the json [sic] file" and there is no JSON file mentioned
anywhere in the question. There are mapping keys that look like
filenames for JSON files, so I hope we can assume you mean "From the value
associated with the mapping key that ends in the string .json".
The other confusing thing is that you obfuscate the fact that you want tuples
but load list nested in list nested in lists from you YAML document.
If you want tuples, it is much more clear to specify them in your YAML document:
config1.json:
- !!python/tuple [[foo, bar], [hello, world]]
config2.json:
- !!python/tuple [[foo], [self._hi]]
So you can do:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='unsafe')
with open('my.yaml') as fp:
my_yaml = yaml.load(fp)
for key in my_yaml:
for idx, elem in enumerate(my_yaml[key]):
print('{}[{}] -> {}'.format(key, idx, my_yaml[key][idx]))
which directly gives you the tuples you seem to want instead of lists you need to process:
config1.json[0] -> (['foo', 'bar'], ['hello', 'world'])
config2.json[0] -> (['foo'], ['self._hi'])
In your question you hard code access to the first and only
element of the sequence that are the values for the root level
mapping. This forces you to use the final [0] in your for loop. I
assume you are going to have multiple elements in those sequences, but
for a good question you should leave that out, as it is irrelevant for the
question on how to get the tuples, and thereby only obfuscating things.
Please note that you need to keep control over your input, as using
typ='unsafe' is, you guessed, unsafe. If you cannot guarantee that
use typ='safe' and register and use the tag !tuple.

Preallocate very large array in Python leads to MemoryError

I am trying to preallocate a list in python
c=[1]*mM #preallocate array
My Problem is that I run in to a MemoryError since
mM=4999999950000000
What is the best way to deal with this. I am thinking about creating a new object where is split my list at about a value of 500000000.
Is this what I should do or is there a best practice to create an array with a lot of inputs?
Using a Generator
You are attempting to create an object that you very likely will not be able to fit into your computer's memory. If you truly need to represent a list of that length, you can use a generator that dynamically produces values as they are needed.
def ones_generator(length):
for _ in range(length):
yield 1
gen = ones_generator(4999999950000000)
for i in gen:
print(i) # prints 1, a lot
Note: The question is tagged for Python 3, but if you are using Python 2.7, you will want to use xrange instead of range.
Using a Dictionary
By the sound of your question, you do not actually need to preallocate a list of that length, but you want to store values very sparsely at indexes that are very large. This pattern matches the dict type in Python more so than the list. You can simply store values in a dictionary, without pre-allocating they keys/space, Python handles that under the hood for you.
dct = {}
dct[100000] = "A string"
dct[592091] = 123
dct[4999999950000000] = "I promise, I need to be at this index"
print(dct[4999999950000000])
# I promise, I need to be at this index
In that example, I just stored str and int values, but they can be any object in Python. The best part about this is that this dictionary will not consume memory based on the maximum index (like a list would) but instead based on how many values are stored within it.

How to list all datasets in h5py file?

I have a h5py file storing numpy arrays, but I got Object doesn't exist error when trying to open it with the dataset name I remember, so is there a way I can list what datasets the file has?
with h5py.File('result.h5','r') as hf:
#How can I list all dataset I have saved in hf?
You have to use the keys method. This will give you a List of unicode strings of your dataset and group names.
For example:
Datasetnames=hf.keys()
Another gui based method would be to use HDFView.
https://support.hdfgroup.org/products/java/release/download.html
The other answers just tell you how to make a list of the keys under the root group, which may refer to other groups or datasets.
If you want something closer to h5dump but in python, you can do something like that:
import h5py
def descend_obj(obj,sep='\t'):
"""
Iterate through groups in a HDF5 file and prints the groups and datasets names and datasets attributes
"""
if type(obj) in [h5py._hl.group.Group,h5py._hl.files.File]:
for key in obj.keys():
print sep,'-',key,':',obj[key]
descend_obj(obj[key],sep=sep+'\t')
elif type(obj)==h5py._hl.dataset.Dataset:
for key in obj.attrs.keys():
print sep+'\t','-',key,':',obj.attrs[key]
def h5dump(path,group='/'):
"""
print HDF5 file metadata
group: you can give a specific group, defaults to the root group
"""
with h5py.File(path,'r') as f:
descend_obj(f[group])
If you want to list the key names, you need to use the keys() method which gives you a key object, then use the list() method to list the keys:
with h5py.File('result.h5','r') as hf:
dataset_names = list(hf.keys())
If you are at the command line, use h5ls -r [file] or h5dump -n [file] as recommended by others.
Within python, if you want to list below the topmost group but you don't want to write your own code to descend the tree, try the visit() function:
with h5py.File('result.h5','r') as hf:
hf.visit(print)
Or for something more advanced (e.g. to include attributes info) use visititems:
def printall(name, obj):
print(name, dict(obj.attrs))
with h5py.File('result.h5','r') as hf:
hf.visititems(printall)
Since using the keys() function will give you only the top level keys and will also contain group names as well as datasets (as already pointed out by Seb), you should use the visit() function (as suggested by jasondet) and keep only keys that point to datasets.
This answer is kind of a merge of jasondet's and Seb's answers to a simple function that does the trick:
def get_dataset_keys(f):
keys = []
f.visit(lambda key : keys.append(key) if isinstance(f[key], h5py.Dataset) else None)
return keys
Just for showing the name of the underlying datasets, I would simply use h5dump -n <filename>
That is without running a python script.

Mutability of Python Generator Expressions versus List and Dictionary Comprehension: Nested Dictionary Weirdness

I am using Python 3.5 to create a set of generators to parse a set of opened files in order to cherry pick data from those files to construct an object I plan to export later. I was originally parsing through the entirety of each file and creating a list of dictionary objects before doing any analysis, but this process would take up to 30 seconds sometimes, and since I only need to work with each line of each file only once, I figure its a great opportunity to use a generator. However, I feel that I am missing something conceptually with generators, and perhaps the mutability of objects within a generator.
My original code that makes a list of dictionaries goes as follows:
parsers = {}
# iterate over files in the file_name file to get their attributes
for dataset, data_file in files.items():
# Store each dataset as a list of dictionaries with keys that
# correspond to the attributes of that dataset
parsers[dataset] = [{attributes[dataset][i]: value.strip('~')
for i, value in enumerate(line.strip().split('^'))}
for line
in data_file]
And I access the the list by calling:
>>>parsers['definitions']
And it works as expected returning a list of dictionaries. However when I convert this list into a generator, all sorts of weirdness happens.
parsers = {}
# iterate over files in the file_name file to get their attributes
for dataset, data_file in files.items():
# Store each dataset as a list of dictionaries with keys that
# correspond to the attributes of that dataset
parsers[dataset] = ({attributes[dataset][i]: value.strip('~')
for i, value in enumerate(line.strip().split('^'))}
for line
in data_file)
And I call it by using:
>>> next(parsers['definitions'])
Running this code returns an index out of range error.
The main difference I can see between the two code segments is that in the list comprehension version, python constructs the list from the file and moves on without needing to store the comprehensions variables for later use.
Conversely, in the generator expression the variables defined within the generator need to be stored with the generator, as they effect each successive call of the generator later in my code. I am thinking that perhaps the variables inside the generator are sharing a namespace with the other generators my code creates, and so each generator has erratic behavior based on whatever generator expression was run last, and therefore set the values of the variables last.
I appreciate any thoughts as to the reason for this issue!
I assume that the problem is when you're building the dictionaries.
attributes[dataset][i]
Note that with the list version, dataset is whatever dataset was at that particular turn of the for loop. However, with the generator, that expression isn't evaluated until after the for loop has completed, so dataset will have the value of the last dataset from the files.items() loop...
Here's a super simple demo that hopefully elaborates on the problem:
results = []
for a in [1, 2, 3]:
results.append(a for _ in range(3))
for r in results:
print(list(r))
Note that we always get [3, 3, 3] because when we take the values from the generator, the value of a is 3.

Accessing Data from .mat (version 8.1) structure in Python

I have a Matlab (.mat, version >7.3) file that contains a structure (data) that itself contains many fields. Each field is a single column array. Each field represents an individual sensor and the array is the time series data. I am trying to open this file in Python to do some more analysis. I am using PyTables to read the data in:
import tables
impdat = tables.openFile('data_file.mat')
This reads the file in and I can enter the fileObject and get the names of each field by using:
impdat.root.data.__members__
This prints a list of the fields:
['rdg', 'freqlabels', 'freqbinsctr',... ]
Now, what I would like is a method to take each field in data and make a python variable (perhaps dictionary) with the field name as the key (if it is a dictionary) and the corresponding array as its value. I can see the size of the array by doing, for example:
impdat.root.data.rdg
which returns this:
/data/rdg (EArray(1, 1286920), zlib(3))
atom := Int32Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (1, 16290)
My question is how do I access some of the data stored in that large array (1, 1286920). How can I read that array into another Python variable (list, dictionary, numpy array, etc.)? Any thoughts or guidance would be appreciated.
I have come up with a working solution. It is not very elegant as it requires an eval. So I first create a new variable (alldata) to the data I want to access, and then I create an empty dictionary datastruct, then I loop over all the members of data and assign the arrays to the appropriate key in the dictionary:
alldata = impdat.root.data
datastruct = {}
for names in impdat.rood.data.__members___:
datastruct[names] = eval('alldata.' + names + '[0][:]')
The '[0]' could be superfluous depending on the structure of the data trying to access. In my case the data is stored in an array of an array and I just want the first one. If you come up with a better solution please feel free to share.
I can't seem to replicate your code. I get an error when trying to open the file which I made in 8.0 using tables.
How about if you took the variables within the structure and saved them to a new mat file which only contains a collection of variables. This would make it much easier to deal with and this has already been answered quite eloquently here.
Which states that mat files which are arrays are simply hdf5 files which can be read with:
import numpy as np, h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to numpy array
Not sure the size of the data set you're working with. If it's large I'm sure I could come up with a script to pull the fields out of the structures. I did find this tool which may be helpful. It recursively gets all of the structure field names.

Categories