Retrieving hdf5 index given literal name

Retrieving hdf5 index given literal name - python

I have an hdf5 database with 3 keys (features, image_ids, index). The image_ids and index each have 1000 entries.
The problem is, while I can get the 10th image_ids via:
dbhdf5 ["image_ids"][10]
>>> u'image001.jpg'
I want to do the reverse, i.e. find the index by passing the image name. Something like:
dbhdf5 ["image_ids"="image001.jpg"]
or
dbhdf5 ["image_ids"]["image001.jpg"]
or
dbhdf5 ['index']['image001.jpg']
I've tried every variation I can think of, but can't seem to find a way to retrieve the index of an image, given it's id. I get errors like 'Field name only allowed for compound types'

What you are trying is not possible. HDF5 works by storing arrays, that are accessed via numerical indices.
Supposing that you also manage the creation of the file, you can store your data in separate named arrays:
\index
\-- image001.jpg
\-- image002.jpg
...
\features
\-- image001.jpg
\-- image002.jpg
...
So you can access them via names:
dbhdf5['features']['image001.jpg']
If the files are generated by someone else, you have to store the keys yourself, for instance with a dict:
lookup = {}
for i, key in enumerate(dbhdf5['image_ids'][:]):
lookup[key] = i
and access them via this indirection
dbhdf5['index'][lookup['image001.jpg']]

Related

PyDICOM Returns KeyError Even Though Field Exists

I'm reading in a DICOM with pydicom.read_file() like this:
x = pydicom.read_file(/path/to/dicom/)
This returns an instance of FileDataset but I get an error when trying to access a value like this:
x[0x20,0x32]
OUTPUT: *** KeyError: (0020, 0032)
I also tried accessing the value like this:
x.ImagePositionPatient
OUTPUT: *** AttributeError: 'FileDataset' object has no attribute 'ImagePositionPatient'
The reason this confuses me is because when I look at all the values using x.values I see the field is in fact present:
(0020, 0032) Image Position (Patient)
How can the key be missing if I can clearly see that it exists?
I'm not an expert on PyDICOM but it should work just like a regular Python dictionary.

For Enhanced SOP classes (Enhanced CT, Enhanced MR and some more) many tags are located in sequences: in the Shared Functional Groups Sequence for tags that are common for all slices, and in the Per frame Functional Groups Sequence for tags specific to each slice, with an item for each slice.
(note that the links are specific to Enhanced MR to match your dataset)
Image Position (Patient) is a slice-specific tag, so you can access the value for a each slice in the corresponding item. The functional group sequences do not directly contain the tags, but they are nested in another sequence - in the case of Image Position (Patient) this is the Plane Position Sequence. To access these values in pydicom you can do something like:
ds = dcmread(filename)
positions = []
for item in ds.PerFrameFunctionalGroupsSequence:
positions.append(item.PlanePositionSequence[0].ImagePositionPatient)
print(positions)
This will give you the image position for each slice.
You should check that you have the correct SOP class, of course, and add checks for existence of the tags. Depending on your use case, you may only need the first of these positions, in which case you wouldn't have to iterate over the items.

How to insert/edit a column in an existing HDF5 dataset

I have a HDF5 file as seen below. I would like edit the index column and create a new timestamp index. Is there any way to do this?

This isn't possible, unless you have the scheme / specification used to create the HDF5 files in the first place.
Many things can go wrong if you attempt to use HDF5 files like a spreadsheet (even via h5py). For example:
Inconsistent chunk shape, compression, data types.
Homogeneous data becoming non-homogeneous.
What you could do is add a list as an attribute to the dataset. In fact, this is probably the right thing to do. Sample code below, with the input as a dictionary. When you read in the data, you link the attributes to the homogeneous data (by row, column, or some other identifier).
def add_attributes(hdf_file, attributes, path='/'):
"""Add or change attributes in path provided.
Default path is root group.
"""
assert os.path.isfile(hdf_file), "File Not Found Exception '{0}'.".format(hdf_file)
assert isinstance(attributes, dict), "attributes argument must be a key: value dictionary: {0}".format(type(attributes))
with h5py.File(hdf_file, 'r+') as hdf:
for k, v in attributes.items():
hdf[path].attrs[k] = v
return "The following attributes have been added or updated: {0}".format(list(attributes.keys()))

Python - wrap same object to make it unique

I have a dictionary that is being built while iterating through objects. Now same object can be accessed multiple times. And I'm using object itself as a key.
So if same object is accessed more than once, then key becomes not unique and my dictionary is no longer correct.
Though I need to access it by object, because later on if someone wants access contents by it, they can request to get it by current object. And it will be correct, because it will access the last active object at that time.
So I'm wondering if it is possible to wrap object somehow, so it would keep its state and all attributes the same, but the only difference would be this new kind of object which is actually unique.
For example:
dct = {}
for obj in some_objects_lst:
# Well this kind of wraps it, but it loses state, so if I would
# instantiate I would lose all information that was in that obj.
wrapped = type('Wrapped', (type(obj),), {})
dct[wrapped] = # add some content
Now if there are some better alternatives than this, I would like to hear it too.
P.S. objects being iterated would be in different context, so even if object is the same, it would be treated differently.
Update
As requested, to give better example where the problem comes from:
I have this excel reports generator module. Using it, you can generate various excel reports. For that you need to write configuration using python dictionary.
Now before report is generated, it must do two things. Get metadata (metadata here is position of each cell that will be when report is about to be created) and second, parse configuration to fill cells with content.
One of the value types that can be used in this module, is formula (excel formulas). And the problem in my question is specifically with one of the ways formula can be computed: formula values that are retrieved for parent , that are in their childs.
For example imagine this excel file structure:
A | B | C
Total Childs Name Amount
1 sum(childs)
2 child_1 10
3 child_2 20
4 sum(childs)
...
Now in this example sum on cell 1A, would need to be 10+20=30 if sum would use expression to sum their childs column (in this case C column). And all of this is working until same object (I call it iterables) is repeated. Because when building metadata it I need to store it, to retrieve later. And key is object being iterated itself. So now when it will be iterated again when parsing values, it will not see all information, because some will overwritten by same object.
For example imagine there are invoice objects, then there are partner objects which are related with invoices and there are some other arbitrary objects that given invoice and partner produce specific amounts.
So when extracting such information in excel, it goes like this:
inoice1 -> partner1 -> amount_obj1, amount_obj2
invoice2 -> partner1 -> amount_obj3, amount_obj4.
Notice that partner in example is the same. Here is the problem, because I can't store this as key, because when parsing values, I will iterate over this object twice when metadata will actually hold values for amount_obj3 and amount_obj4
P.S Don't know if I explained it better, cause there is lots of code and I don't want to put huge walls of code here.
Update2
I'll try to explain this problem from more abstract angle, because it seems being too specific just confuses everyone even more.
So given objects list and empty dictionary, dictionary is built by iterating over objects. Objects act as a key in dictionary. It contains metadata used later on.
Now same list can be iterated again for different purpose. When its done, it needs to access that dictionary values using iterated object (same objects that are keys in that dictionary). But the problem is, if same object was used more than once, it will have only latest stored value for that key.
It means object is not unique key here. But the problem is the only thing I know is the object (when I need to retrieve the value). But because it is same iteration, specific index of iteration will be the same when accessing same object both times.
So uniqueness I guess then is (index, object).

I'm not sure if I understand your problem, so here's two options. If it's object content that matters, keep object copies as a key. Something crude like
new_obj = copy.deepcopy(obj)
dct[new_obj] = whatever_you_need_to_store(new_obj)
If the object doesn't change between the first time it's checked by your code and the next, the operation is just performed the second time with no effect. Not optimal, but probably not a big problem. If it does change, though, you get separate records for old and new ones. For memory saving you will probably want to replace copies with hashes, __str__() method that writes object data or whatever. But that depends on what your object is; maybe hashing will take too much time for miniscule savings in memory. Run some tests and see what works.
If, on the other hand, it's important to keep the same value for the same object, whether the data within it have changed or not (say, object is a user session that can change its data between login and logoff), use object ids. Not the builtin id() function, because if the object gets GCed or deleted, some other object may get its id. Define an id attribute for your objects and make sure different objects cannot possibly get the same one.

reading HDF4 file with python - more than one dataset with same name

I have a HDF4 file I need to read with python. For this I use pyhdf. In most cases I am quite happy to use SD class to open the file:
import pyhdf.SD as SD
hdf = SD.SD(hdfFile)
and then continue with
v1 = hdf.select('Data set 1')
v2 = hdf.select('Data set 2')
However I have several groups in the HDF file and some variables appear in more than one group having the same name:
In Group 1 I have Data set 3 and in Group 2 I have Data set 3 so my select command will only select one of then I guess (without me knowing which one?).
Is there a simple way of selecting (reading) Data set 3 from Group 1 and then from Group 2?
I have looked at the V and VS modules. I found an example script that will loop through all groups and subgroups etc. and find all variables (data sets). But I have now Idea of how to connect those variables to the parent, as for me to know into which group they belong.

I think that pyhdf might not be the best choice for this particular task. Have you looked at PyNIO?
From the HDF section of their documentation:
PyNIO has a read-only ability to understand HDF Vgroups. When a variable that is part of a Vgroup is encountered, PyNIO appends a double underscore and the group number to the end of the variable name. This ensures that the variable will have a unique name, relative to variables that belong to other Vgroups. It also provides two additional attributes to the variable: hdf_group, whose value is the HDF string name of the group, and hdf_group_id, whose value is the same as the group number appended to the end of the variable name.

Access an item in ListProperty

I have a Model like below
class XXX(db.Model):
f_list = db.ListProperty(int,indexed=True) #Store 50000 numbers
How to access the 3rd item in f_list?

You would use a standard list indexing operation to access the 3rd item in the list
some_obj.f_list[2]
However the entire entity will be loaded into memory when you fetch an instance of XXX
There is no way around it with the model you have.
Even a projection query will return the entire list.
The only possibility would be to start creating multiple sub entities.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.