How to save/extract dataset from hdf5 and convert into TiFF? - python

I am trying to import CT scan data into ImageJ/FIJI (There is HDF5 plugin in ImageJ/Fiji, however the synchrotron CT data has so large datasets.. so it was failed to open). The scan data (Image dataset) is saved as dataset into the hdf5 file. So I have to extract image dataset from the hdf5 file, then converted it into the Tiff file.
HdF5 File path is "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
Herein, 'SNT_BTO4_S1_1_1pag_db0005_vol.hdf5' is divided into several datasets, and the image dataset is in here:/entry0000/reconstruction/results/data
At the moment, I accessed to the image dataset using h5py. However, after that, I am in stuck to extract/save the dataset separately from the hdf5 file.
Which code is required to extract the image dataset from the hdf5 file?
After that, I am thinking of using from PIL to Image then convert the image into Tiff file. Can I get any advice on the code for this?
import numpy as np
import h5py
filename = "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
with h5py.File(filename,'r') as hdf:
base_items = list (hdf.items())
print('#Items in the base directory:', base_items)
G1 = hdf.get ('entry0000')
G1_items = list (G1.items())
print('#Items in entry0000', G1_items)
G11 = G1.get ('/entry0000/reconstruction')
G11_items = list (G11.items())
print('#Items in reconstruction', G11_items)
G12 = G11.get ('/entry0000/reconstruction/results')
G12_items = list (G12.items())
print('#Items in results', G12_items)

Extracting image data from an HDF5 file and converting to an image is a "relatively straight forward" 2 step process:
Access the data in the HDF5 file
Convert to an image with cv2 (or PIL)
A simple example is available here: How to extract individual JPEG images from a HDF5 file.
You can apply the same process to your file. Here is some pseudo-code. It's not complete because you don't show the shape of the image dataset (and the shape affects how to read the data). Also, you didn't say how many images are in dataset /entry0000/reconstruction/results/data --- does it have a single image or multiple images. If multiple images, which axis is the image counter?
import h5py
import cv2 ## for image conversion
filename = "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
with h5py.File(filename,'r') as hdf:
# get image dataset
img_ds = hdf['/entry0000/reconstruction/results/data']
print(f'Image Dataset info: Shape={img_ds.shape},Dtype={img_ds.dtype}')
## following depends on dataset shape/schema
## code below assumes images are along axis=0
for i in range(img_ds.shape[0]):
cv2.imwrite(f'test_img_{i:03}.tiff',img_ds[i,:]) # uses slice notation
# alternately load to a numpy array first
img_arr = img_ds[i,:] # slice notation gets [i,:,:,:]
Note: you don't need to use .get() to get a dataset. You can simply reference the dataset path. Also, when you use a group object, use the relative path from the dataset to the group, not the absolute path. (You should modify your code to reflect these changes.) For example, the following are equivalent
G1 = hdf['entry0000']
## is the same as G1 = hdf.get('entry0000')
G11 = hdf['entry0000/reconstruction']
## is the same as G11 = hdf.get('entry0000/reconstruction')
## OR referencing G1 group object:
G11 = G1['reconstruction']
## is the same as G11 = G1.get('reconstruction')


Reshape(-1) images in a h5 dataset

I am taking a Pattern recognition subject in this semester. I have a project to do face detection system from 3000++ images. I am using python for this project.
What I have done so far is convert the image into numpy array and store inside a list using code below:
# convert to numpy array, then grayscale, then resize, then vectorize, finally store in
# a list
for file in sorted(img_path):
img = cv2.imread(file)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray = cv2.resize(img_gray, dsize=(150, 150), interpolation=cv2.INTER_CUBIC)
img_gray = img_gray.reshape(-1)
# save to .h5 file, not yet do for label dataset
hf = h5py.File(save_path, 'a')
dset = hf.create_dataset('dataset',data=imagesData)
There is a small question here, is reshape(-1) mean vectorize? I try imagesData.shape, it print out (22500,), originally (150,150)
The images are from a google drive folder(consisit of .png image). I am using sorted in looping because I want to arrange and store the numpy array in list from first to last images (1223 - 5222). Why I do this because I was given a text file containing some features that arranged from (1223-5222) and I going to store both dataset (imagesData) and label datasets (features) inside a .h5 file. The features text file as below:
text file
Am I right? because after store both dataset and label datasets into .h5 file, I will load them out and start some machine algorithm for my project, so I have to make sure each row of sample match correct label.

loading dicoms with pydicom and sitk results different outputs

My problem is a bit wired. I am working on Prostate MRI dataset which contains dicom images. When I load dicom files using Simple ITK the output numpy array's dtype will be float64 . But when I load same dicom files using pydicom , the output numpy array's dtype will be uint16 And the problem isn't just this. Pixel intensities would be different when using different module. So my question is why they look different and which one is correct and why these modules load data differently?
this is the code I'm using to load dcm files.
import pydicom
import SimpleITK as sitk
path = 'dicoms/1.dcm'
def read_using_sitk():
reader = sitk.ImageFileReader()
image = reader.Execute()
numpy_array = sitk.GetArrayFromImage(image)
return numpy_array.dtype
def read_using_pydicom():
dataset = pydicom.dcmread(path)
numpy_array = dataset.pixel_array
return numpy_array.dtype
The difference is that pydicom loads the original data as saved in the dataset (which is usually uint16 for MR data), while SimpleITK does some preprocessing (most likely applying the LUT) and returns the processed data as a float array.
In pydicom, to get data suitable for display, you have to apply some lookup table yourself, usually the one coming with the image.
If you have a modality LUT (not very common for MR data), you first have to apply that using apply_modality_lut, while for the VOI LUT you use apply_voi_lut. This will apply both modality and VOI LUT as found in the dataset:
ds = dcmread(fname)
arr = ds.pixel_array
out = apply_modality_lut(arr, ds)
display_data = apply_voi_lut(out, ds, index=0)
This can be savely used even if no modality or VOI LUT is present in the dataset - in this case just the input data is returned.
Note that there can be more than one VOI LUT in a DICOM image, for example to show different kinds of tissue - thus the index argument, though that is also not very common in MR images.

How to find a file/ data from a given data set in python- opencv image processing project?

I have a data set of images in an image processing project. I want to input an image and scan through the data set to recognize the given image. What module/ library/ approach( eg: ML) should I use to identify my image in my python- opencv code?
To find exactly the same image, you don't need any kind of ML. The image is just an array of pixels, so you can check if the array of the input image equals that of an image in your dataset.
import glob
import cv2
import numpy as np
# Read in source image (the one you want to match to others in the dataset)
source = cv2.imread('test.jpg')
# Make a list of all the images in the dataset (I assume they are images in a directory)
filelist = glob.glob(r'C:\Users\...\Images\*.JPG')
# Loop through the images, read them in and check if an image is equal to your source
for file in filelist:
img = cv2.imread(file)
if np.array_equal(source, img):
print("%s is the same image as source" %(file))

Extracting images from matlab file

I'm trying to extract the images (and its label and such) from an RGB-D dataset called NYUV2 dataset. (I downloaded the labelled dataset)
It's a matlab file so I tried using hdf5 to read it but I don't know how to proceed from here. How do I save the images and its corresponding labels and depths into a different folder??
Here's the script that I used and its corresponding output.
import numpy as np
import h5py
f = h5py.File('nyu_depth_v2_labeled.mat','r')
k = list(f.keys())
Output is
['#refs#', '#subsystem#', 'accelData', 'depths', 'images', 'instances', 'labels', 'names', 'namesToIds', 'rawDepthFilenames', 'rawDepths', 'rawRgbFilenames', 'sceneTypes', 'scenes']
I hope this helps.
I suppose you are using the PIL package The function fromarray expects the "mode of the image" see
I suppose your image is in RGB. I believe the image souhld be under group 'images' and dataset image_name
import h5py
import numpy as np
from PIL import Image
hdf = h5py.File('nyu_depth_v2_labeled.mat','r')
array = np.array(list(hdf.get("images/image_name")))
img = Image.fromarray(array.astype('uint8'), 'RGB')
You can also look at another answer I gave to know how to save images
Images saved as HDF5 arent colored
To view the content of the h5 file, download HDFview, it will help navigate through it.

h5py: How to index over multiple large HDF5 files without loading all their content into memory

This is a question about working with multiple HDF5 datasets simultaneously while treating them as one dataset as far as possible.
I have multiple .h5 files, each of which contains tens of thousands of images. Let's call the files
I now want to create a list or array that contains "pointers" to all images of all three files, without actually loading the images.
Here is what I have so far:
I first open all files:
file01 = h5py.File('file01.h5', 'r')
file02 = h5py.File('file02.h5', 'r')
file03 = h5py.File('file03.h5', 'r')
and add their image datasets to a list:
images = []
where file01['images'] is an HDF5 dataset of shape e.g. (52722, 3, 160, 320), i.e. 52722 images. All good so far, none of the content has been loaded into memory yet. Now I want to make these three separate image lists into one so that I can work with it as if it were one large dataset. I tried to do this:
images = np.concatenate(images)
This is where it breaks. As soon as I concatenate the three HDF5 datasets, it actually loads the images as Numpy arrays and I run out of memory.
What would be the best way to solve this?
I need a solution that allows me to Numpy-slice and index into the three datasets as if it were one.
For example, assume each dataset contained 50,000 images and I wanted to load the third image of each dataset, I need a list images that allows me to index those images as
batch = images[[2, 50002, 100002]]
HDF5 have introduced the concept of a "Virtual Dataset (VDS)".
However, this does not work for versions before 1.10.
I have no experience with the VDS feature but the h5py docs go into more detail and the h5py git repository has an example file here :
'''A simple example of building a virtual dataset.
This makes four 'source' HDF5 files, each with a 1D dataset of 100 numbers.
Then it makes a single 4x100 virtual dataset in a separate file, exposing
the four sources as one dataset.
import h5py
import numpy as np
# Create source files (1.h5 to 4.h5)
for n in range(1, 5):
with h5py.File('{}.h5'.format(n), 'w') as f:
d = f.create_dataset('data', (100,), 'i4')
d[:] = np.arange(100) + n
# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 100), dtype='i4')
for n in range(1, 5):
filename = "{}.h5".format(n)
vsource = h5py.VirtualSource(filename, 'data', shape=(100,))
layout[n - 1] = vsource
# Add virtual dataset to output file
with h5py.File("VDS.h5", 'w', libver='latest') as f:
f.create_virtual_dataset('data', layout, fillvalue=-5)
print("Virtual dataset:")
print(f['data'][:, :10])
More details can be found on the HDF group which links to a pdf. Figure 1 illustrates the idea nicely.
