How to insert/edit a column in an existing HDF5 dataset - python

I have a HDF5 file as seen below. I would like edit the index column and create a new timestamp index. Is there any way to do this?

This isn't possible, unless you have the scheme / specification used to create the HDF5 files in the first place.
Many things can go wrong if you attempt to use HDF5 files like a spreadsheet (even via h5py). For example:
Inconsistent chunk shape, compression, data types.
Homogeneous data becoming non-homogeneous.
What you could do is add a list as an attribute to the dataset. In fact, this is probably the right thing to do. Sample code below, with the input as a dictionary. When you read in the data, you link the attributes to the homogeneous data (by row, column, or some other identifier).
def add_attributes(hdf_file, attributes, path='/'):
"""Add or change attributes in path provided.
Default path is root group.
"""
assert os.path.isfile(hdf_file), "File Not Found Exception '{0}'.".format(hdf_file)
assert isinstance(attributes, dict), "attributes argument must be a key: value dictionary: {0}".format(type(attributes))
with h5py.File(hdf_file, 'r+') as hdf:
for k, v in attributes.items():
hdf[path].attrs[k] = v
return "The following attributes have been added or updated: {0}".format(list(attributes.keys()))

Related

Saving a DataFrame with some extra information

I am trying to store some extra information with DataFrames directly in the same DataFrame, such as some parameters describing the data stored.
I added this information just as extra attributes to the DataFrame:
df.data_origin = 'my_origin'
print(df.data_origin)
But when it is saved and loaded, those extra attributes are lost:
df.to_pickle('pickle_test.pkl')
df2 = pd.read_pickle('pickle_test.pkl')
print(len(df2))
print(df2.definition)
...
465387
>>> AttributeError: 'DataFrame' object has no attribute 'definition'
The workaround I have found is to save the dict of the DataFrame and then assign it to the dict of an empty DataFrame:
with open('modified_dataframe.pkl', "wb") as pkl_out:
pickle.dump(df.__dict__, pkl_out)
df2 = pd.DataFrame()
with open('modified_dataframe.pkl', "rb") as pkl_in:
df2.__dict__ = pickle.load(pkl_in)
print(len(df2))
print(df2.data_origin)
...
465387
my_origin
It seems to work, but:
Is there a better way to do it?
Am I losing information? (apparently, all the data is there)
Here a different solution is discussed, but I would like to know if the approach of saving the dict of a class is valid to hold its entire information.
EDIT: Ok, I found the big drawback. This works fine to save single DataFrames in isolated files, but will not work if I have dictionaries, lists or similar with DataFrames in them.
I suggest that you can get your things done by making a new child class for pandas.DataFrame, make a new class inherit things from pandas.DataFrame class, and add your wanted attributes there. This may seem a bit spooky, but you can play around with it safely when you using in different places. Other stuff might be useful for specific cases though.

How to list all datasets in h5py file?

I have a h5py file storing numpy arrays, but I got Object doesn't exist error when trying to open it with the dataset name I remember, so is there a way I can list what datasets the file has?
with h5py.File('result.h5','r') as hf:
#How can I list all dataset I have saved in hf?
You have to use the keys method. This will give you a List of unicode strings of your dataset and group names.
For example:
Datasetnames=hf.keys()
Another gui based method would be to use HDFView.
https://support.hdfgroup.org/products/java/release/download.html
The other answers just tell you how to make a list of the keys under the root group, which may refer to other groups or datasets.
If you want something closer to h5dump but in python, you can do something like that:
import h5py
def descend_obj(obj,sep='\t'):
"""
Iterate through groups in a HDF5 file and prints the groups and datasets names and datasets attributes
"""
if type(obj) in [h5py._hl.group.Group,h5py._hl.files.File]:
for key in obj.keys():
print sep,'-',key,':',obj[key]
descend_obj(obj[key],sep=sep+'\t')
elif type(obj)==h5py._hl.dataset.Dataset:
for key in obj.attrs.keys():
print sep+'\t','-',key,':',obj.attrs[key]
def h5dump(path,group='/'):
"""
print HDF5 file metadata
group: you can give a specific group, defaults to the root group
"""
with h5py.File(path,'r') as f:
descend_obj(f[group])
If you want to list the key names, you need to use the keys() method which gives you a key object, then use the list() method to list the keys:
with h5py.File('result.h5','r') as hf:
dataset_names = list(hf.keys())
If you are at the command line, use h5ls -r [file] or h5dump -n [file] as recommended by others.
Within python, if you want to list below the topmost group but you don't want to write your own code to descend the tree, try the visit() function:
with h5py.File('result.h5','r') as hf:
hf.visit(print)
Or for something more advanced (e.g. to include attributes info) use visititems:
def printall(name, obj):
print(name, dict(obj.attrs))
with h5py.File('result.h5','r') as hf:
hf.visititems(printall)
Since using the keys() function will give you only the top level keys and will also contain group names as well as datasets (as already pointed out by Seb), you should use the visit() function (as suggested by jasondet) and keep only keys that point to datasets.
This answer is kind of a merge of jasondet's and Seb's answers to a simple function that does the trick:
def get_dataset_keys(f):
keys = []
f.visit(lambda key : keys.append(key) if isinstance(f[key], h5py.Dataset) else None)
return keys
Just for showing the name of the underlying datasets, I would simply use h5dump -n <filename>
That is without running a python script.

Retrieving hdf5 index given literal name

I have an hdf5 database with 3 keys (features, image_ids, index). The image_ids and index each have 1000 entries.
The problem is, while I can get the 10th image_ids via:
dbhdf5 ["image_ids"][10]
>>> u'image001.jpg'
I want to do the reverse, i.e. find the index by passing the image name. Something like:
dbhdf5 ["image_ids"="image001.jpg"]
or
dbhdf5 ["image_ids"]["image001.jpg"]
or
dbhdf5 ['index']['image001.jpg']
I've tried every variation I can think of, but can't seem to find a way to retrieve the index of an image, given it's id. I get errors like 'Field name only allowed for compound types'
What you are trying is not possible. HDF5 works by storing arrays, that are accessed via numerical indices.
Supposing that you also manage the creation of the file, you can store your data in separate named arrays:
\index
\-- image001.jpg
\-- image002.jpg
...
\features
\-- image001.jpg
\-- image002.jpg
...
So you can access them via names:
dbhdf5['features']['image001.jpg']
If the files are generated by someone else, you have to store the keys yourself, for instance with a dict:
lookup = {}
for i, key in enumerate(dbhdf5['image_ids'][:]):
lookup[key] = i
and access them via this indirection
dbhdf5['index'][lookup['image001.jpg']]

Accessing Data from .mat (version 8.1) structure in Python

I have a Matlab (.mat, version >7.3) file that contains a structure (data) that itself contains many fields. Each field is a single column array. Each field represents an individual sensor and the array is the time series data. I am trying to open this file in Python to do some more analysis. I am using PyTables to read the data in:
import tables
impdat = tables.openFile('data_file.mat')
This reads the file in and I can enter the fileObject and get the names of each field by using:
impdat.root.data.__members__
This prints a list of the fields:
['rdg', 'freqlabels', 'freqbinsctr',... ]
Now, what I would like is a method to take each field in data and make a python variable (perhaps dictionary) with the field name as the key (if it is a dictionary) and the corresponding array as its value. I can see the size of the array by doing, for example:
impdat.root.data.rdg
which returns this:
/data/rdg (EArray(1, 1286920), zlib(3))
atom := Int32Atom(shape=(), dflt=0)
maindim := 0
flavor := 'numpy'
byteorder := 'little'
chunkshape := (1, 16290)
My question is how do I access some of the data stored in that large array (1, 1286920). How can I read that array into another Python variable (list, dictionary, numpy array, etc.)? Any thoughts or guidance would be appreciated.
I have come up with a working solution. It is not very elegant as it requires an eval. So I first create a new variable (alldata) to the data I want to access, and then I create an empty dictionary datastruct, then I loop over all the members of data and assign the arrays to the appropriate key in the dictionary:
alldata = impdat.root.data
datastruct = {}
for names in impdat.rood.data.__members___:
datastruct[names] = eval('alldata.' + names + '[0][:]')
The '[0]' could be superfluous depending on the structure of the data trying to access. In my case the data is stored in an array of an array and I just want the first one. If you come up with a better solution please feel free to share.
I can't seem to replicate your code. I get an error when trying to open the file which I made in 8.0 using tables.
How about if you took the variables within the structure and saved them to a new mat file which only contains a collection of variables. This would make it much easier to deal with and this has already been answered quite eloquently here.
Which states that mat files which are arrays are simply hdf5 files which can be read with:
import numpy as np, h5py
f = h5py.File('somefile.mat','r')
data = f.get('data/variable1')
data = np.array(data) # For converting to numpy array
Not sure the size of the data set you're working with. If it's large I'm sure I could come up with a script to pull the fields out of the structures. I did find this tool which may be helpful. It recursively gets all of the structure field names.

Representing filesystem table

I’m working on simple class something like “in memory linux-like filesystem” for educational purposes. Files will be as StringIO objects. I can’t make decision how to implement files-folders hierarchy type in Python. I’m thinking about using list of objects with fields: type, name, parent what else? Maybe I should look for trees and graphs.
Update:
There will be these methods:
new_dir(path),
dir_list(path),
is_file(path),
is_dir(path), remove(path),
read(file_descr),
file_descr open(file_path, mode=w|r),
close(file_descr),
write(file_descr, str)
It's perfectly possible to represent a tree as a nested set of lists. However, since entries are typically indexed by name, and a directory is generally considered to be unordered, nested dictionaries would make many operations faster and easier to write.
I wouldn't store the parent for each entry though, that's implicit from its position in the hierarchy.
Also, if you want your virtual file system to efficiently support hard links, you need to separate a file's contents from the directory hierarchy. That way, you can re-use the contents by giving each piece of content any number of names, which is what hard linking does.
May be you can try using networkx. You just have to intutive to adapt it to use with files and folder.
A simple example
import os,networkx as nx
G=nx.Graph()
for (path, dirs, files) in os.walk(os.getcwd()):
bname = os.path.split(path)
for f in files:
G.add_edge(bname,f)
# Now do what ever you want with the Graph
You should first ask the question: What operations should my "file system" support?
Based on the answer you select the data representation.
For example, if you choose to support only create and delete and the order of the files in the dictionary is not important, then select a python dictionary. A dictionary will map a file name (sub path name) to either a dictionary or the file container object.
What's the API of the filestore? Do you want to keep creation, modification and access times? Presumably the primary lookup will be by file name. Are any other retrieval operations anticipated?
If only lookup by name is required then one possible representation is to map the filestore root directory on to a Python dict. Each entry's key will be the filename, and the value will either be a StringIO object (hint: in Python 2 use cStringIO for better performance if it becomes an issue) or another dict. The StringIO objects represent your files, the dicts represent subdirectories.
So, to access any path you split it up into its constituent components (using .split("/")) and then use each to look up a successive element. Any KeyError exceptions imply "File or directory not found," as would any attempts to index a StringIO object (I'm too lazy to verify the specific exception).
If you want to implement greater detail then you would replace the StringIO objects and dicts with instances of some "filestore object" class. You could call it a "link" (since that's what it models: A Linux hard link). The various attributes of this object can easily be manipulated to keep the file attributes up to date, and the .data attribute can be either a StringIO object or a dict as before.
Overall I would prefer the second solution, since then it's easy to implement methods that do things like keep access times up to date by updating them as the operations are performed, but as I said much depends on the level of detail you want to provide.

Categories