How to list all datasets in h5py file? - python

I have a h5py file storing numpy arrays, but I got Object doesn't exist error when trying to open it with the dataset name I remember, so is there a way I can list what datasets the file has?
with h5py.File('result.h5','r') as hf:
#How can I list all dataset I have saved in hf?

You have to use the keys method. This will give you a List of unicode strings of your dataset and group names.
For example:
Datasetnames=hf.keys()
Another gui based method would be to use HDFView.
https://support.hdfgroup.org/products/java/release/download.html

The other answers just tell you how to make a list of the keys under the root group, which may refer to other groups or datasets.
If you want something closer to h5dump but in python, you can do something like that:
import h5py
def descend_obj(obj,sep='\t'):
"""
Iterate through groups in a HDF5 file and prints the groups and datasets names and datasets attributes
"""
if type(obj) in [h5py._hl.group.Group,h5py._hl.files.File]:
for key in obj.keys():
print sep,'-',key,':',obj[key]
descend_obj(obj[key],sep=sep+'\t')
elif type(obj)==h5py._hl.dataset.Dataset:
for key in obj.attrs.keys():
print sep+'\t','-',key,':',obj.attrs[key]
def h5dump(path,group='/'):
"""
print HDF5 file metadata
group: you can give a specific group, defaults to the root group
"""
with h5py.File(path,'r') as f:
descend_obj(f[group])

If you want to list the key names, you need to use the keys() method which gives you a key object, then use the list() method to list the keys:
with h5py.File('result.h5','r') as hf:
dataset_names = list(hf.keys())

If you are at the command line, use h5ls -r [file] or h5dump -n [file] as recommended by others.
Within python, if you want to list below the topmost group but you don't want to write your own code to descend the tree, try the visit() function:
with h5py.File('result.h5','r') as hf:
hf.visit(print)
Or for something more advanced (e.g. to include attributes info) use visititems:
def printall(name, obj):
print(name, dict(obj.attrs))
with h5py.File('result.h5','r') as hf:
hf.visititems(printall)

Since using the keys() function will give you only the top level keys and will also contain group names as well as datasets (as already pointed out by Seb), you should use the visit() function (as suggested by jasondet) and keep only keys that point to datasets.
This answer is kind of a merge of jasondet's and Seb's answers to a simple function that does the trick:
def get_dataset_keys(f):
keys = []
f.visit(lambda key : keys.append(key) if isinstance(f[key], h5py.Dataset) else None)
return keys

Just for showing the name of the underlying datasets, I would simply use h5dump -n <filename>
That is without running a python script.

Related

h5py: how to use keys() loop over HDF5 Groups and Datasets

print(list(file.keys()))
When I run this code I get:
T00000000,T00000001,T00000002,T00000003, ... ,T00000474
Now, I analized T00000000 but I want to scan them all with for loop. I couldn't do because this is a string. Is there any way to do this?
#python_student, there is more to this than explained in the initial answer. Based on the syntax of you question, it appears you are using h5py to read the HDF5 file. To effectively access the file contents, you need a basic understanding of HDF5 and h5py. I suggest starting here: h5py Quick Start Guide. In addition, there are many good questions and answer here on StackOverflow with details and examples.
An HDF5 file has 2 basic objects:
Datasets: array-like collections of data
Groups: folder-like containers that hold datasets and other groups
h5py, uses dictionary syntax to access Group objects, and reads Datasets using NumPy syntax. (Note group objects are not Python dictionaries - just just "look" like them!)
As you noted, the keys() are the NAMES of the objects (groups or datasets) at the root level of your file. Your code created a list from the group keys: list(file.keys()). In general there is no reason to do this. Typically, you will iterate over the keys() or items() instead of creating a list.
Here is a short code segment to show how you might do this. I can add more details once I know more about your data schema. (HDF5 is a general data container and have most any schema.)
# loop on names:
for name in file.keys():
print(name)
# loop on names and H5 objects:
for name, h5obj in file.items():
if isinstance(h5obj,h5py.Group):
print(name,'is a Group')
elif isinstance(h5obj,h5py.Dataset):
print(name,'is a Dataset')
# return a np.array using dataset object:
arr1 = h5obj[:]
# return a np.array using dataset name:
arr2 = file[name][:]
# compare arr1 to arr2 (should always return True):
print(np.array_equal(arr1, arr2))
Yes, you can use the split() method.
If the string is "T00000000,T00000001,T00000002,T00000003, ... ,T00000474", you can use split to turn it on a list like this:
string = "T00000000,T00000001,T00000002,T00000003, ... ,T00000474"
values = string.split(",")
So, the list values becomes ["T00000000", "T00000001","T00000003", ... ,"T000000474"].
Then you can use this in a for loop.
If you don't wanto to create a list, you can simply:
for value in string.split(","):
#Your code here...
The for loop will be execute with the values T00000000, T00000001, T00000003 ...

Dictionary update replaces original values instead

Whenever I receive a new url, I try to add that in my dictionary, along with the current time.
However, when I use the update() method, it replaces original values with the new values I added, so that the only thing in the dictionary now are the new values (and not the old ones).
Here is a shorter version of my code:
if domain not in lst:
lst.append(domain)
domaindict = {}
listofdomains.append(domaindict)
domaindict.update({domain:datetime.now().strftime('%m/%d/%Y %H:%M:%S')})
if domain in lst:
domindex = lst.index(domain)
listofdomains[domindex].update({domain:datetime.now().strftime('%m/%d/%Y %H:%M:%S')})
lst is the list of domain names so far, while listofdomains is the list that contains all the dictionaries of the separate domains (each dictionary has the domain name plus the time).
When I try to print listofdomains:
print(listofdomains)
It only prints out the newly added domain and urls in the dictionaries. I also tried to use other methods to update a dictionary, as detailed in the answers to this question, but my dictionaries are still not functioning properly.
Why did the original key/value pairs dissapear?
The simplest structure would probably be a dict of lists:
data = {domain1:[time1, time2, ...], domain2:[...] ...}
You can build it simply using a defaultdict that creates empty lists on the fly when needed. Your code would be:
from collections import defaultdict
data = defaultdict(list)
and your whole code becomes simply:
data[domain].append(datetime.now().strftime('%m/%d/%Y %H:%M:%S'))

How do I use dictionary to store the filename generated by a method in Python?

I am using a method to generate multiple xml files.
I want to keep track of the files generated by the method using dictionary
map = {}
dstFile = f'path-to-dir\\{self.name}.xml'
with open(dstFile,'w') as f_out:
f_out.write( u'<?xml version="1.0" encoding="UTF-8"?>'+'\n')
f_out.write( ET.tostring(self.root).decode('UTF-8')
map = {f'{self.name}':f'{self.name}.xml'}
But using the map dictionary this way, previous values in the dictionary got overwritten
I want that when the method generate a file, its name will get added to the dictionary keeping the older key value pairs also.
Thanks.
This line
map = {f'{self.name}':f'{self.name}.xml'}
creates a new dictionary and assigns it to the variable map. You want to add a new key-value pair to the existing dictionary.
You can do this the following way:
map[self.name] = f"{self.name}.xml"
You can use dict's method update for example like this:
map.update(f'{self.name}'=f'{self.name}.xml')

Dynamically generate array elements from yaml file in Python

Given the following yaml file stored in my_yaml that contains varying sets of dictionary keys and/or class variables (denoted by self._*):
config1.json:
- [[foo, bar], [hello, world]]
config2.json:
- [[foo], [self._hi]]
From the json file, I want to populate a new list of tuples. The items in each tuple are determined by looking up dict keys in this yaml file.
So if I iterate through a dictionary called config1.json, and I have an empty list called config_list, I want to do something like:
config_list.append(tuple[i['foo']['bar],i['hello']['world']])
But if it were config2.json, I want to do something like:
config_list.append(tuple[i['foo'],self._hi])
I can do this in a less dynamic way:
for i in my_yaml['config1.json'][0]:
config_list.append(tuple([ i[my_yaml[asset][0][0]][my_yaml[asset][0][1]],i[my_yaml[asset][1][0]][my_yaml[asset][1][1]]]))
or:
for i in my_yaml['config2.json'][0]:
config_list.append(tuple([ i[my_yaml[asset][0][0]],i[my_yaml[asset][1][0]]]))
Instead I would like to dynamically generate the contents of config_list
Any ideas or alternatives would be greatly appreciated.
I think you are bit confusing things, first of all because you are referring
to a file in "From the json [sic] file" and there is no JSON file mentioned
anywhere in the question. There are mapping keys that look like
filenames for JSON files, so I hope we can assume you mean "From the value
associated with the mapping key that ends in the string .json".
The other confusing thing is that you obfuscate the fact that you want tuples
but load list nested in list nested in lists from you YAML document.
If you want tuples, it is much more clear to specify them in your YAML document:
config1.json:
- !!python/tuple [[foo, bar], [hello, world]]
config2.json:
- !!python/tuple [[foo], [self._hi]]
So you can do:
import sys
import ruamel.yaml
yaml = ruamel.yaml.YAML(typ='unsafe')
with open('my.yaml') as fp:
my_yaml = yaml.load(fp)
for key in my_yaml:
for idx, elem in enumerate(my_yaml[key]):
print('{}[{}] -> {}'.format(key, idx, my_yaml[key][idx]))
which directly gives you the tuples you seem to want instead of lists you need to process:
config1.json[0] -> (['foo', 'bar'], ['hello', 'world'])
config2.json[0] -> (['foo'], ['self._hi'])
In your question you hard code access to the first and only
element of the sequence that are the values for the root level
mapping. This forces you to use the final [0] in your for loop. I
assume you are going to have multiple elements in those sequences, but
for a good question you should leave that out, as it is irrelevant for the
question on how to get the tuples, and thereby only obfuscating things.
Please note that you need to keep control over your input, as using
typ='unsafe' is, you guessed, unsafe. If you cannot guarantee that
use typ='safe' and register and use the tag !tuple.

How to insert/edit a column in an existing HDF5 dataset

I have a HDF5 file as seen below. I would like edit the index column and create a new timestamp index. Is there any way to do this?
This isn't possible, unless you have the scheme / specification used to create the HDF5 files in the first place.
Many things can go wrong if you attempt to use HDF5 files like a spreadsheet (even via h5py). For example:
Inconsistent chunk shape, compression, data types.
Homogeneous data becoming non-homogeneous.
What you could do is add a list as an attribute to the dataset. In fact, this is probably the right thing to do. Sample code below, with the input as a dictionary. When you read in the data, you link the attributes to the homogeneous data (by row, column, or some other identifier).
def add_attributes(hdf_file, attributes, path='/'):
"""Add or change attributes in path provided.
Default path is root group.
"""
assert os.path.isfile(hdf_file), "File Not Found Exception '{0}'.".format(hdf_file)
assert isinstance(attributes, dict), "attributes argument must be a key: value dictionary: {0}".format(type(attributes))
with h5py.File(hdf_file, 'r+') as hdf:
for k, v in attributes.items():
hdf[path].attrs[k] = v
return "The following attributes have been added or updated: {0}".format(list(attributes.keys()))

Categories