Create a data set of images in batch files - python

Having searched through internet, I couldn't find how a set of images with their corresponding labels can be saved as batch files in a way using the following data I'll be able to retrieve (load) them to work on them. simply saying I'm searching for a way to save them in a way to use the following code to retrieve them.
import cPickle as pickle
import numpy as np
import os
def load_CIFAR_batch(filename):
""" load single batch of cifar """
with open(filename, 'rb') as f:
datadict = pickle.load(f)
X = datadict['data']
Y = datadict['labels']
X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float")
Y = np.array(Y)
return X, Y
e.g. the data set of CIFAR10 is already in batch format using cPickle, but I don't know how to use cPickle what they have done to save images.
CIFAR-10 Dataset link : http://www.cs.toronto.edu/~kriz/cifar.html
I'm using:
Ubuntu 14.04 LTS
Python 2.7

Related

Create a custom dataset from a folder with separate files

I am using Pytorch's custom dataset feature to create a custom dataset from separate files in one folder. Each file contains 123 rows and 123 columns, and all data points are integers.
My issue is that the resources I've come across cater to files in one .csv, mine are not. More so, opening the images after being transformed as an image doesn't run as well. I'm not sure how to proceed from here on as my code gives:
AttributeError: 'Image' object has no attribute 'read'
import os
from torch.utils.data import DataLoader, Dataset
from numpy import genfromtxt
# Custom dataset
class CONCEPTDataset(Dataset):
""" Concept Dataset """
def __init__(self, file_dir, transforms=None):
"""
Args:
file_dir (string): Directory with all the images.
transforms (optional): Changes on the data.
"""
self.file_dir = file_dir
self.transforms = transforms
self.concepts = os.listdir(file_dir)
self.concepts.sort()
self.concepts = [os.path.join(file_dir, concept) for concept in self.concepts]
def __len__(self):
return len(self.concepts)
def __getitem__(self, idx):
image = self.concepts[idx]
# csv file to a numpy array using genfromtxt
data = genfromtxt(image, delimiter=',')
data = self.transforms(data.unsqueeze(0))
return data
PIL.Image.fromarray is used to convert an array to a PIL Image while Image.open is used to load an image file from the file system. You don't need either of those two since you already have a NumPy array representing your image and are looking to return it. PyTorch will convert it to torch.Tensor automatically if you plug your dataset to a torch.data.utils.DataLoader.

How to load an image dataset in scikit-learn?

I have collected a group of images that I want to train a model on.
How do I load the image dataset? I have a folder of training data with two folders in it denoting the two different kinds of objects. How would I go about loading this data set and then training a model?
this might help you to load your dataset into data variable from a single folder of images
import cv2
import os
import numpy as np
path = 'path to your dataset'
list_of_files = os.listdir(path)
data = np.empty(0)
for i in list_of_files:
x = cv2.imread(os.path.join(path+i))
data.append(x)

How to implement multi-threaded import of numpy arrays stored on disk as dataset in Tensorflow

The input and labels of my dataset is stored in 10000 .npy files each. For example inputs/0000.npy,...inputs/9999.npy and labels/0000.npy,...labels/9999.npy. While each file independently can be stored in memory, the whole dataset of 20k arrays cannot be stored in memory. I would like to implement multi-threaded CPU pipeline to import the dataset as batches of say batch_size=8.
I have tried to implement the functions mentioned in the new Tensorflow data API but haven't found any example for my requirements. All examples seem to be for cases where the whole dataset can be loaded into RAM. Any idea how to approach this?
I would use tf.data.Dataset.from_generator() which allows you to use Tensorflow data API through a custom python generator function. This way, you can load each .npy file iteratively, having only one numpy.ndarray loaded in memory at once. Assuming that each loaded numpy.ndarray is a single instance, an example code for your case might be something as following:
import tensorflow as tf
import numpy as np
import os
def gen():
inputs_path = ""
labels_path = ""
for input_file, label_file in zip(os.listdir(inputs_path), os.listdir(labels_path)):
x = np.load(os.path.join(inputs_path, input_file))
y = np.load(os.path.join(labels_path, label_file))
yield x, y
INPUT_SHAPE = []
LABEL_SHAPE = []
# Input pipeline
ds = tf.data.Dataset.from_generator(
gen, (tf.float32, tf.int64), (tf.TensorShape(INPUT_SHAPE), tf.TensorShape(LABEL_SHAPE)))
ds = ds.batch(8)
ds_iter = ds.make_initializable_iterator()
inputs_batch, labels_batch = ds_iter.get_next()
I have not tested the code. Hope it helps!

Caffe: Converting CSV file to HDF5

I have learned a little about Caffe framework (which is used define and train deep learning models)
As my first program, I wanted to write a program for training and testing a "Face Emotion Recognition" task using fer2013 dataset
The dataset I have downloaded is in "CSV" format. As I know, for working with Caffe, dataset format has to be in either "lmdb" or "hdf5".
So it seems that the first thing I have to do is to convert my dataset into hdf5 or lmbd formats.
Here is a simple code I tried at first:
import pandas as pd
import numpy as np
import csv
csvFile = pd.HDFStore('PrivateTest.csv')
PrivateTestHDF5 = csvFile.to_hdf(csvFile)
print len(PrivateTestHDF5)
But it doesn't work, and I get this error:
" Unable to open/create file 'PrivateTest.csv "
I have searched alot, I found this link but I still can not understand how does it read from a CSV file.
Also I do not have installed Matlab.
I would be happy if anyone can help me on this. Also if any advice about writing caffe models for datasets that are on Kaggle website or any other dataset ( Those who are not on caffe website )
Your input data doesn't have to be in lmdb or hdf5. You can input data from a csv file. All you have to do is to use an ImageData input layer such as this one:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: false
crop_size: 224
mean_file: "./supporting_files/mean.binaryproto"
}
image_data_param {
source: "./supporting_files/labels_train.txt"
batch_size: 64
shuffle: true
new_height: 339
new_width: 339
}
}
Here, the file "./supporting_files/labels_train.txt" is just a csv file that contains the paths to the input images stored on the file system as regular images.
This is usually the simplest way to provide data to the model. But if you really have to use HDF5 file you can use something like this function:
import h5py
import sys
import numpy as np
def create_h5_file(labels,file_name):
nr_entries = len(labels)
images = np.zeros((nr_entries, 3, width, height), dtype='f4')
image_labels = np.zeros((nr_entries, nr_labels_per_image), dtype='f4')
for i, l in enumerate(labels):
img = caffe.io.load_image(l[0])
# pre process and/or augment your data
images[i] = img
image_labels[i] = [int(x) for x in l[1]]
with h5py.File(file_name, "w") as H:
H.create_dataset("data", data=images)
H.create_dataset("label", data=image_labels)
where file_name is a string with the path of the hdf5 output file and labels are and labels is an array of tuples such as ("/path/to/my/image",["label1","label2",...,"labeln"]).
Notice that this function works for datasets with multiple labels per image (one valid reason for using hdf5 instead of a csv file), but you probably only need a single label per image.
A bit late, but wanted to point out that if the csv file is too big to load into memory you can use pandas "chunksize" to split the file and load the chunks one by one to HDF5:
import pandas as pd
csvfile = 'yourCSVfile.csv'
hdf5File = 'yourh5File.h5'
tp = pd.read_csv('CSVfile', chunksize=100000)
for chunk in tp:
chunk.to_hdf(hdf5File, key = 'data', mode ='a', format='table', append = True)
Note that the append = True is for Table format.

How to store a python ndarray on disk?

I have a pkl file containing an ndarray that I originally dump using a GPU. I unpickle it with the GPU and now I want to store it as whatever, that I can later uncompress using a CPU. I run everything on a supercomputer and later I want to just have access to the ndarrays on a normal computer without a fancy GPU. I looked into functions such as
np.save()
np.savez()
but save() I can't set allow_pickle=False and when I load the array stored with savez() it's empty.
This is how I save things:
I run THEANO_FLAGS="device=gpu,floatX=float32" srun -u python deep_q_rl/unpicklestuff.py
unpicklestuff.py:
import sys
import cPickle
import lasagne.layers
import os
import numpy as np
for i in os.listdir(path):
net_file = open(path+str(i), 'r')
network = cPickle.load(net_file)
q_layers = lasagne.layers.get_all_layers(network.l_out)
np.savez(savepath+str(i), q_layers)
And this is how I load them later:
q_layers = np.load(path)

Categories