Caffe: Converting CSV file to HDF5 - python

I have learned a little about Caffe framework (which is used define and train deep learning models)
As my first program, I wanted to write a program for training and testing a "Face Emotion Recognition" task using fer2013 dataset
The dataset I have downloaded is in "CSV" format. As I know, for working with Caffe, dataset format has to be in either "lmdb" or "hdf5".
So it seems that the first thing I have to do is to convert my dataset into hdf5 or lmbd formats.
Here is a simple code I tried at first:
import pandas as pd
import numpy as np
import csv
csvFile = pd.HDFStore('PrivateTest.csv')
PrivateTestHDF5 = csvFile.to_hdf(csvFile)
print len(PrivateTestHDF5)
But it doesn't work, and I get this error:
" Unable to open/create file 'PrivateTest.csv "
I have searched alot, I found this link but I still can not understand how does it read from a CSV file.
Also I do not have installed Matlab.
I would be happy if anyone can help me on this. Also if any advice about writing caffe models for datasets that are on Kaggle website or any other dataset ( Those who are not on caffe website )

Your input data doesn't have to be in lmdb or hdf5. You can input data from a csv file. All you have to do is to use an ImageData input layer such as this one:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: false
crop_size: 224
mean_file: "./supporting_files/mean.binaryproto"
}
image_data_param {
source: "./supporting_files/labels_train.txt"
batch_size: 64
shuffle: true
new_height: 339
new_width: 339
}
}
Here, the file "./supporting_files/labels_train.txt" is just a csv file that contains the paths to the input images stored on the file system as regular images.
This is usually the simplest way to provide data to the model. But if you really have to use HDF5 file you can use something like this function:
import h5py
import sys
import numpy as np
def create_h5_file(labels,file_name):
nr_entries = len(labels)
images = np.zeros((nr_entries, 3, width, height), dtype='f4')
image_labels = np.zeros((nr_entries, nr_labels_per_image), dtype='f4')
for i, l in enumerate(labels):
img = caffe.io.load_image(l[0])
# pre process and/or augment your data
images[i] = img
image_labels[i] = [int(x) for x in l[1]]
with h5py.File(file_name, "w") as H:
H.create_dataset("data", data=images)
H.create_dataset("label", data=image_labels)
where file_name is a string with the path of the hdf5 output file and labels are and labels is an array of tuples such as ("/path/to/my/image",["label1","label2",...,"labeln"]).
Notice that this function works for datasets with multiple labels per image (one valid reason for using hdf5 instead of a csv file), but you probably only need a single label per image.

A bit late, but wanted to point out that if the csv file is too big to load into memory you can use pandas "chunksize" to split the file and load the chunks one by one to HDF5:
import pandas as pd
csvfile = 'yourCSVfile.csv'
hdf5File = 'yourh5File.h5'
tp = pd.read_csv('CSVfile', chunksize=100000)
for chunk in tp:
chunk.to_hdf(hdf5File, key = 'data', mode ='a', format='table', append = True)
Note that the append = True is for Table format.

Related

How to save/extract dataset from hdf5 and convert into TiFF?

I am trying to import CT scan data into ImageJ/FIJI (There is HDF5 plugin in ImageJ/Fiji, however the synchrotron CT data has so large datasets.. so it was failed to open). The scan data (Image dataset) is saved as dataset into the hdf5 file. So I have to extract image dataset from the hdf5 file, then converted it into the Tiff file.
HdF5 File path is "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
Herein, 'SNT_BTO4_S1_1_1pag_db0005_vol.hdf5' is divided into several datasets, and the image dataset is in here:/entry0000/reconstruction/results/data
At the moment, I accessed to the image dataset using h5py. However, after that, I am in stuck to extract/save the dataset separately from the hdf5 file.
Which code is required to extract the image dataset from the hdf5 file?
After that, I am thinking of using from PIL to Image then convert the image into Tiff file. Can I get any advice on the code for this?
import numpy as np
import h5py
filename = "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
with h5py.File(filename,'r') as hdf:
base_items = list (hdf.items())
print('#Items in the base directory:', base_items)
#entry0000
G1 = hdf.get ('entry0000')
G1_items = list (G1.items())
print('#Items in entry0000', G1_items)
#reconstruction
G11 = G1.get ('/entry0000/reconstruction')
G11_items = list (G11.items())
print('#Items in reconstruction', G11_items)
#results_data
G12 = G11.get ('/entry0000/reconstruction/results')
G12_items = list (G12.items())
print('#Items in results', G12_items)
Extracting image data from an HDF5 file and converting to an image is a "relatively straight forward" 2 step process:
Access the data in the HDF5 file
Convert to an image with cv2 (or PIL)
A simple example is available here: How to extract individual JPEG images from a HDF5 file.
You can apply the same process to your file. Here is some pseudo-code. It's not complete because you don't show the shape of the image dataset (and the shape affects how to read the data). Also, you didn't say how many images are in dataset /entry0000/reconstruction/results/data --- does it have a single image or multiple images. If multiple images, which axis is the image counter?
import h5py
import cv2 ## for image conversion
filename = "F:/New_ESRF/SNT_BTO4/SNT_BTO4_S1/SNT_BTO4_S1_1_1pag_db0005_vol.hdf5"
with h5py.File(filename,'r') as hdf:
# get image dataset
img_ds = hdf['/entry0000/reconstruction/results/data']
print(f'Image Dataset info: Shape={img_ds.shape},Dtype={img_ds.dtype}')
## following depends on dataset shape/schema
## code below assumes images are along axis=0
for i in range(img_ds.shape[0]):
cv2.imwrite(f'test_img_{i:03}.tiff',img_ds[i,:]) # uses slice notation
# alternately load to a numpy array first
img_arr = img_ds[i,:] # slice notation gets [i,:,:,:]
cv2.imwrite(f'test_img_{i:03}.tiff',img_arr)
Note: you don't need to use .get() to get a dataset. You can simply reference the dataset path. Also, when you use a group object, use the relative path from the dataset to the group, not the absolute path. (You should modify your code to reflect these changes.) For example, the following are equivalent
G1 = hdf['entry0000']
## is the same as G1 = hdf.get('entry0000')
G11 = hdf['entry0000/reconstruction']
## is the same as G11 = hdf.get('entry0000/reconstruction')
## OR referencing G1 group object:
G11 = G1['reconstruction']
## is the same as G11 = G1.get('reconstruction')

How to read images using csv in python

I have a csv file that contains three columns namely (image file names, labels(in form of 0,1...), and class names). While another folder contains all the images. I want to read the images using this csv file and use them further for the task of image classification using deep learning models on python.
You can try this:
import pandas as pd
import cv2
import numpy as np
from pathlib import Path
df = pd.read_csv('path_to_csv.csv')
path = Path('path/to/img/dir') # relative path is OK
dataset = df.apply(lambda file_name, label:
return (cv2.imread(path/file_name), label),
axis=1)
dataset = np.asarray(dataset)
Now you'll have dataset that is a numpy matrix with images as column 1 and labels as column 2
I'm using OpenCV to import image. Of course you can use others like pillow too, but OpenCV is faster than pillow

How to implement multi-threaded import of numpy arrays stored on disk as dataset in Tensorflow

The input and labels of my dataset is stored in 10000 .npy files each. For example inputs/0000.npy,...inputs/9999.npy and labels/0000.npy,...labels/9999.npy. While each file independently can be stored in memory, the whole dataset of 20k arrays cannot be stored in memory. I would like to implement multi-threaded CPU pipeline to import the dataset as batches of say batch_size=8.
I have tried to implement the functions mentioned in the new Tensorflow data API but haven't found any example for my requirements. All examples seem to be for cases where the whole dataset can be loaded into RAM. Any idea how to approach this?
I would use tf.data.Dataset.from_generator() which allows you to use Tensorflow data API through a custom python generator function. This way, you can load each .npy file iteratively, having only one numpy.ndarray loaded in memory at once. Assuming that each loaded numpy.ndarray is a single instance, an example code for your case might be something as following:
import tensorflow as tf
import numpy as np
import os
def gen():
inputs_path = ""
labels_path = ""
for input_file, label_file in zip(os.listdir(inputs_path), os.listdir(labels_path)):
x = np.load(os.path.join(inputs_path, input_file))
y = np.load(os.path.join(labels_path, label_file))
yield x, y
INPUT_SHAPE = []
LABEL_SHAPE = []
# Input pipeline
ds = tf.data.Dataset.from_generator(
gen, (tf.float32, tf.int64), (tf.TensorShape(INPUT_SHAPE), tf.TensorShape(LABEL_SHAPE)))
ds = ds.batch(8)
ds_iter = ds.make_initializable_iterator()
inputs_batch, labels_batch = ds_iter.get_next()
I have not tested the code. Hope it helps!

h5py: How to index over multiple large HDF5 files without loading all their content into memory

This is a question about working with multiple HDF5 datasets simultaneously while treating them as one dataset as far as possible.
I have multiple .h5 files, each of which contains tens of thousands of images. Let's call the files
file01.h5
file02.h5
file03.h5
I now want to create a list or array that contains "pointers" to all images of all three files, without actually loading the images.
Here is what I have so far:
I first open all files:
file01 = h5py.File('file01.h5', 'r')
file02 = h5py.File('file02.h5', 'r')
file03 = h5py.File('file03.h5', 'r')
and add their image datasets to a list:
images = []
images.append(file01['images'])
images.append(file02['images'])
images.append(file03['images'])
where file01['images'] is an HDF5 dataset of shape e.g. (52722, 3, 160, 320), i.e. 52722 images. All good so far, none of the content has been loaded into memory yet. Now I want to make these three separate image lists into one so that I can work with it as if it were one large dataset. I tried to do this:
images = np.concatenate(images)
This is where it breaks. As soon as I concatenate the three HDF5 datasets, it actually loads the images as Numpy arrays and I run out of memory.
What would be the best way to solve this?
I need a solution that allows me to Numpy-slice and index into the three datasets as if it were one.
For example, assume each dataset contained 50,000 images and I wanted to load the third image of each dataset, I need a list images that allows me to index those images as
batch = images[[2, 50002, 100002]]
HDF5 have introduced the concept of a "Virtual Dataset (VDS)".
However, this does not work for versions before 1.10.
I have no experience with the VDS feature but the h5py docs go into more detail and the h5py git repository has an example file here :
'''A simple example of building a virtual dataset.
This makes four 'source' HDF5 files, each with a 1D dataset of 100 numbers.
Then it makes a single 4x100 virtual dataset in a separate file, exposing
the four sources as one dataset.
'''
import h5py
import numpy as np
# Create source files (1.h5 to 4.h5)
for n in range(1, 5):
with h5py.File('{}.h5'.format(n), 'w') as f:
d = f.create_dataset('data', (100,), 'i4')
d[:] = np.arange(100) + n
# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 100), dtype='i4')
for n in range(1, 5):
filename = "{}.h5".format(n)
vsource = h5py.VirtualSource(filename, 'data', shape=(100,))
layout[n - 1] = vsource
# Add virtual dataset to output file
with h5py.File("VDS.h5", 'w', libver='latest') as f:
f.create_virtual_dataset('data', layout, fillvalue=-5)
print("Virtual dataset:")
print(f['data'][:, :10])
More details can be found on the HDF group which links to a pdf. Figure 1 illustrates the idea nicely.

Create a data set of images in batch files

Having searched through internet, I couldn't find how a set of images with their corresponding labels can be saved as batch files in a way using the following data I'll be able to retrieve (load) them to work on them. simply saying I'm searching for a way to save them in a way to use the following code to retrieve them.
import cPickle as pickle
import numpy as np
import os
def load_CIFAR_batch(filename):
""" load single batch of cifar """
with open(filename, 'rb') as f:
datadict = pickle.load(f)
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
Y = np.array(Y)
return X, Y
e.g. the data set of CIFAR10 is already in batch format using cPickle, but I don't know how to use cPickle what they have done to save images.
CIFAR-10 Dataset link : http://www.cs.toronto.edu/~kriz/cifar.html
I'm using:
Ubuntu 14.04 LTS
Python 2.7

Categories