Create tensorflow dataset from image local directory

Create tensorflow dataset from image local directory - python

I have a very huge database of images locally, with the data distribution like each folder cointains the images of one class.
I would like to use the tensorflow dataset API to obtain batches de data without having all the images loaded in memory.
I have tried something like this:
def _parse_function(filename, label):
image_string = tf.read_file(filename, "file_reader")
image_decoded = tf.image.decode_jpeg(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, label
image_list, label_list, label_map_dict = read_data()
dataset = tf.data.Dataset.from_tensor_slices((tf.constant(image_list), tf.constant(label_list)))
dataset = dataset.shuffle(len(image_list))
dataset = dataset.repeat(epochs).batch(batch_size)
dataset = dataset.map(_parse_function)
iterator = dataset.make_one_shot_iterator()
image_list is a list where the path (and name) of the images have been appended and label_list is a list where the class of each image has been appended in the same order.
But the _parse_function does not work, the error that I recibe is:
ValueError: Shape must be rank 0 but is rank 1 for 'file_reader' (op: 'ReadFile') with input shapes: [?].
I have googled the error, but nothing works for me.
If I do not use the map function, I just recibe the path of the images (which are store in image_list), so I think that I need the map function to read the images, but I am not able to make it works.
Thank you in advance.
EDIT:
def read_data():
image_list = []
label_list = []
label_map_dict = {}
count_label = 0
for class_name in os.listdir(base_path):
class_path = os.path.join(base_path, class_name)
label_map_dict[class_name]=count_label
for image_name in os.listdir(class_path):
image_path = os.path.join(class_path, image_name)
label_list.append(count_label)
image_list.append(image_path)
count_label += 1

The error is in this line dataset = dataset.repeat(epochs).batch(batch_size) Your pipeline adds batchsize as a dimension to input.
You need to batch your dataset after map function like this
dataset = tf.data.Dataset.from_tensor_slices((tf.constant(image_list), tf.constant(label_list)))
dataset = dataset.shuffle(len(image_list))
dataset = dataset.repeat(epochs)
dataset = dataset.map(_parse_function).batch(batch_size)

Related

Dataset created with tf.data.Dataset.list_files - how can I access the individual file paths while using the map function?

I am trying to build an image classification pipeline based on this tutorial here, starting at "Using tf.data for finer control".
There are some necessary differences between my implementation and the one in the tutorial:
All my images are in one folder (not separated by class in different folders), in the code this folder is called path_to_imagefiles
The information, about the label is stored in a csv-file, I made a df_metadata in my code below that shows how it is structured if I read the csv-file.
I pass my data as a list of absolut paths to the images (in the code called list_of_imagefile_paths) to tf.data.Dataset.list_files.
There is no way around those changes because I require this workflow for my own data later on.
My problem is that I don't know how to change the code to get the labels from df_metadata instead of the file-path (which is how it is done in the tutorial).
I tried to change the get_label(file_path) function from the tutorial the following way:
def get_label(file_path):
parts = tf.strings.split(file_path, os.path.sep) #same as in the tutorial
image_name = parts[-1].numpy() #get the name of the image
df_image = df_metadata[df_metadata.values == image_name] #find the image name in the dataframe
one_hot = df_image.iloc[-1, 3] == class_names #df_image.iloc[-1, 3] gets the label
# Integer encode the label
return tf.argmax(one_hot)
Now the issue is, that image_name = parts[-1].numpy() doesn't seem to result in 'image1.jpg', which I would need to make it work. I get the following error: 'AttributeError: 'Tensor' object has no attribute 'numpy'
How can I access the information stored in df_metadata to get my labels? Any ideas?
Here is the whole code:
import tensorflow as tf
import pandas as pd
import os
import matplotlib.pyplot as plt
def get_label(file_path):
#This is the part where I need help
parts = tf.strings.split(file_path, os.path.sep)
image_name = parts[-1]
df_image = df_metadata[df_metadata.values == image_name]
one_hot = df_image.iloc[-1, 3] == class_names
# Integer encode the label
return tf.argmax(one_hot)
def decode_img(img):
# Convert the compressed string to a 3D uint8 tensor
img = tf.io.decode_jpeg(img, channels=3)
# Resize the image to the desired size
return tf.image.resize(img, [IMG_SIZE, IMG_SIZE])
def process_path(file_path):
label = get_label(file_path)
# Load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
#setup paths
path_to_imagefiles = r'P:\somefolder\somesubfolder\all_flower_photos'
#setup parameters
IMG_SIZE = 180
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 64
#creating the df_metadata for demonstration purposes, I actually read them from a csv-file
metadata = [['image1.jpg', 100, 200, 'rose'], ['image2.jpg', 100, 200, 'tulip'],
['image3.jpg', 100, 200, 'sunflower'], ['image4.jpg', 100, 200, 'dandelion'],
['image5.jpg', 100, 200, 'daisy'], ['image6.jpg', 100, 200, 'rose']]
df_metadata = pd.DataFrame(metadata)
list_of_imagefile_names = df_metadata.iloc[:,0].to_list()
list_of_imagefile_paths = []
for image_name in list_of_imagefile_names:
list_of_imagefile_paths.append(os.path.join(path_to_imagefiles, image_name))
list_ds = tf.data.Dataset.list_files(list_of_imagefile_paths, shuffle=False)
list_ds = list_ds.shuffle(len(list_ds), reshuffle_each_iteration=False)
class_names = df_metadata.iloc[:,3].unique()
val_size = int(len(list_ds)*0.2)
train_ds = list_ds.skip(val_size)
val_ds = list_ds.take(val_size)
train_ds = train_ds.map(process_path, num_parallel_calls=AUTOTUNE)
val_ds = val_ds.map(process_path, num_parallel_calls=AUTOTUNE)

image_dataset_from_directory using a subset of sub-directories

I have downloaded the MINC dataset for material classification which consists of 23 cateogories. However, I am only interested in a subset of the categories (e.g. [wood, foliage, glass, hair])
Is it possible to get a subset of the data using tf.keras.preprocessing.image_dataset_from_directory?
I have tried tf.keras.preprocessing.image_dataset_from_directory(folder_dir, label_mode="categorical", class_names=["wood", "foliage", "glass", "hair"]) but it give this error The `class_names` passed did not match the names of the subdirectories of the target directory.
Is there a way to get a subset of the directories without deleting or modifying the folders? I know datagen.flow_from_directory is able to do it but keras says that it is deprecated and I should use image_dataset_from_directory.

There are two ways of doing this the first way is to do this by generator, but that process is costly, there is another way of doing this called Using tf.data for finer control. You can check this out at this link
https://www.tensorflow.org/tutorials/load_data/images
But, I will show you a brief demo that how you can load only the folders of your choice.
So, let's start...
#First import some libraries which are needed
import os
import glob
import tensorflow as tf
import matplotlib.pyplot as plt
I am taking only two classes of "Cats" vs "Dogs" you can take more than two classes...
batch_size = 32
img_height = 180
img_width = 180
#define your data directory where your dataset is placed
data_dir = path to your datasetfolder
#Now, here define a list of names for your dataset, like I am only loading cats and dogs... you can fill it with more if you have more
dataset_names = ['cats' , 'dogs']
#Now, glob the list of images in these two directories (cats & Dogs)
list_files = [glob.glob(data_dir + images + '/*.jpg') for images in folders]
list_files = list_files[0] + list_files[1]
image_count = len(list_files)
#Now, here pass this list to a tf.data.Dataset
list_files = tf.data.Dataset.from_tensor_slices(list_files)
#Now, define your class names to labels your dataset later...
class_names = ['cats', 'dogs']
#Now, here define the validation, test, train etc.
val_size = int(image_count * 0.2)
train_ds = list_files.skip(val_size)
val_ds = list_files.take(val_size)
#To get labels
def get_label(file_path):
# Convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
parts = tf.strings.substr(parts, -4, 4)[0]
one_hot = parts == class_names
# Integer encode the label
return tf.argmax(one_hot)
def decode_img(img):
# Convert the compressed string to a 3D uint8 tensor
img = tf.io.decode_jpeg(img, channels=3)
# Resize the image to the desired size
return tf.image.resize(img, [img_height, img_width])
def process_path(file_path):
label = get_label(file_path)
# Load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
#Use Dataset.map to create a dataset of image, label pairs:
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
train_ds = train_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)
val_ds = val_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)
#Configure dataset for performance
def configure_for_performance(ds):
ds = ds.cache()
ds = ds.shuffle(buffer_size=1000)
ds = ds.batch(batch_size)
ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
return ds
train_ds = configure_for_performance(train_ds)
val_ds = configure_for_performance(val_ds)
#Visualize the data
image_batch, label_batch = next(iter(train_ds))
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy().astype("uint8"))
label = label_batch[i]
plt.title(class_names[label])
plt.axis("off")
Output:
Link to the COLAB file is:
https://colab.research.google.com/drive/1oUNuGVDWDLqwt_YQ80X-CBRL6kJ_YhUX?usp=sharing

100% RAM usage in google colab

I am currently using google colab for one of my Deep learning projects of sign language recognition model, where I am loading my custom dataset which I created from Google Drive. My dataset contains different folder of alphabets which contains sign of respective alphabets.
This is just a part of code which i m using to create my training data
training_data = []
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR,category) # create path to image of respective alphabet
class_num = CATEGORIES.index(category) # get the classification for each alphabet A : 0, C : 1, D : 2,...
for img in tqdm(os.listdir(path)): # iterate over each image
img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE) # convert to array
training_data.append([img_array, class_num]) # add this to our training_data
create_training_data()
X = []
y = []
for features,label in training_data:
X.append(np.array(features))
y.append(label)
But just this process takes up all the available RAM, So is there any way I can do in order to minimize the RAM usage ??

I can't replicate your training data set, so take this with a grain of salt. If you use a generator for the training data instead of a building it up as a list then you should eliminate half of your memory usage. You still pay the memory cost of X and Y, so this technique may not be sufficient to solve your problem.
def iter_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR,category) # create path to image of respective alphabet
class_num = CATEGORIES.index(category) # get the classification for each alphabet A : 0, C : 1, D : 2,...
for img in tqdm(os.listdir(path)): # iterate over each image
img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE) # convert to array
yield [img_array, class_num]
X = []
y = []
for features,label in iter_training_data():
X.append(np.array(features))
y.append(label)

It takes up all the available RAM as you simply copy all of your data to it.
It might be easier to use DataLoader from PyTorch and define a size of the batch (for not using all the data at once).
import torch
import torchvision
from torchvision import transforms
train_transforms = transforms.Compose([
# transforms.Resize((256, 256)), # might also help in some way, if resize is allowed in your task
transforms.ToTensor() ])
train_dir = '/path/to/train/data/'
train_dataset = torchvision.datasets.ImageFolder(train_dir, train_transforms)
batch_size = 32
train_dataloader = torch.utils.data.DataLoader( train_dataset, batch_size=batch_size )
Then, on train phase you can do something like:
# ...
for inputs, labels in tqdm(train_dataloader):
inputs = inputs.to(device)
labels = labels.to(device)
# ...

Using the Tensorflow Dataset map function to retrieve multiple sentences for a single image

I am trying to implement the Neural Image Caption generator with visual attention proposed in the "Show, Attend and Tell" paper. For validation, I want to compute the Bleu score. For this, I need to get all the captions for a particular image. I thought of approaching the problem as shown below:
data_path = './processed_data'
split_type = 'val'
encoded_captions = np.load(data_path + '/{}/captions_{}.npy'.format(split_type, split_type))
image_paths = json.load(open(data_path + '/{}/img_paths_{}.json'.format(split_type, split_type), 'r'))
encoded_captions, image_paths = shuffle(encoded_captions, image_paths, random_state=0)
dataset = tf.data.Dataset.from_tensor_slices((image_paths, encoded_captions))
dataset = dataset.map(lambda img_path, cap: map_func(img_path, cap, image_paths, encoded_captions))
def map_func(img_path, caption, image_paths, encoded_captions):
matching_captions = [caption for i, caption in enumerate(encoded_captions) if image_paths[i] == img_path]
img = tf.io.read_file(img_path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, (299, 299))
img = tf.keras.applications.inception_v3.preprocess_input(img)
return img, caption, matching_captions
When running the above code, I am getting the following error:
OperatorNotAllowedInGraphError: using a tf.Tensor as a Python bool is not allowed in Graph execution. Use Eager execution or decorate this function with #tf.function.
How do I resolve this issue? Alternatively, is there a more efficient way of getting captions that map to a particular image?
Thank you very much!

tensorflow - how to input a directory by dataset api

I am new to tensorflow, and here is my situation: I have lots of folders and each contains several images. I need my training input to be folders(each time 2 folders), and each time 4 images inside a folder be selected for training.
I have tried Dataset api, and tried to use the map or flat_map function, but I failed to read images inside a folder.
Here is part of my codes:
def parse_function(filename):
print(filename)
batch_data = []
batch_label = []
dir_path = os.path.join(data_path, str(filename))
imgs_list = os.listdir(dir_path)
random.shuffle(imgs_list)
imgs_list = imgs_list * 4 #each time select 4 images
for i in range(img_num):
img_path = os.path.join(dir_path, imgs_list[i])
image_string = tf.read_file(img_path)
image_decoded = tf.image.decode_image(image_string)
image_resized = tf.image.resize_images(image_decoded, [224, 224])
batch_data.append(image_resized)
batch_label.append(label)
return batch_data, batch_label
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
where filename is a list of folder name like '123456', labels is list of label like 0 or 1.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create tensorflow dataset from image local directory - python

Related

Dataset created with tf.data.Dataset.list_files - how can I access the individual file paths while using the map function?

image_dataset_from_directory using a subset of sub-directories

100% RAM usage in google colab

Using the Tensorflow Dataset map function to retrieve multiple sentences for a single image

tensorflow - how to input a directory by dataset api

Categories

Resources