I am currently using google colab for one of my Deep learning projects of sign language recognition model, where I am loading my custom dataset which I created from Google Drive. My dataset contains different folder of alphabets which contains sign of respective alphabets.
This is just a part of code which i m using to create my training data
training_data = []
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR,category) # create path to image of respective alphabet
class_num = CATEGORIES.index(category) # get the classification for each alphabet A : 0, C : 1, D : 2,...
for img in tqdm(os.listdir(path)): # iterate over each image
img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE) # convert to array
training_data.append([img_array, class_num]) # add this to our training_data
create_training_data()
X = []
y = []
for features,label in training_data:
X.append(np.array(features))
y.append(label)
But just this process takes up all the available RAM, So is there any way I can do in order to minimize the RAM usage ??
I can't replicate your training data set, so take this with a grain of salt. If you use a generator for the training data instead of a building it up as a list then you should eliminate half of your memory usage. You still pay the memory cost of X and Y, so this technique may not be sufficient to solve your problem.
def iter_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR,category) # create path to image of respective alphabet
class_num = CATEGORIES.index(category) # get the classification for each alphabet A : 0, C : 1, D : 2,...
for img in tqdm(os.listdir(path)): # iterate over each image
img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE) # convert to array
yield [img_array, class_num]
X = []
y = []
for features,label in iter_training_data():
X.append(np.array(features))
y.append(label)
It takes up all the available RAM as you simply copy all of your data to it.
It might be easier to use DataLoader from PyTorch and define a size of the batch (for not using all the data at once).
import torch
import torchvision
from torchvision import transforms
train_transforms = transforms.Compose([
# transforms.Resize((256, 256)), # might also help in some way, if resize is allowed in your task
transforms.ToTensor() ])
train_dir = '/path/to/train/data/'
train_dataset = torchvision.datasets.ImageFolder(train_dir, train_transforms)
batch_size = 32
train_dataloader = torch.utils.data.DataLoader( train_dataset, batch_size=batch_size )
Then, on train phase you can do something like:
# ...
for inputs, labels in tqdm(train_dataloader):
inputs = inputs.to(device)
labels = labels.to(device)
# ...
Related
I have downloaded the MINC dataset for material classification which consists of 23 cateogories. However, I am only interested in a subset of the categories (e.g. [wood, foliage, glass, hair])
Is it possible to get a subset of the data using tf.keras.preprocessing.image_dataset_from_directory?
I have tried tf.keras.preprocessing.image_dataset_from_directory(folder_dir, label_mode="categorical", class_names=["wood", "foliage", "glass", "hair"]) but it give this error The `class_names` passed did not match the names of the subdirectories of the target directory.
Is there a way to get a subset of the directories without deleting or modifying the folders? I know datagen.flow_from_directory is able to do it but keras says that it is deprecated and I should use image_dataset_from_directory.
There are two ways of doing this the first way is to do this by generator, but that process is costly, there is another way of doing this called Using tf.data for finer control. You can check this out at this link
https://www.tensorflow.org/tutorials/load_data/images
But, I will show you a brief demo that how you can load only the folders of your choice.
So, let's start...
#First import some libraries which are needed
import os
import glob
import tensorflow as tf
import matplotlib.pyplot as plt
I am taking only two classes of "Cats" vs "Dogs" you can take more than two classes...
batch_size = 32
img_height = 180
img_width = 180
#define your data directory where your dataset is placed
data_dir = path to your datasetfolder
#Now, here define a list of names for your dataset, like I am only loading cats and dogs... you can fill it with more if you have more
dataset_names = ['cats' , 'dogs']
#Now, glob the list of images in these two directories (cats & Dogs)
list_files = [glob.glob(data_dir + images + '/*.jpg') for images in folders]
list_files = list_files[0] + list_files[1]
image_count = len(list_files)
#Now, here pass this list to a tf.data.Dataset
list_files = tf.data.Dataset.from_tensor_slices(list_files)
#Now, define your class names to labels your dataset later...
class_names = ['cats', 'dogs']
#Now, here define the validation, test, train etc.
val_size = int(image_count * 0.2)
train_ds = list_files.skip(val_size)
val_ds = list_files.take(val_size)
#To get labels
def get_label(file_path):
# Convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
parts = tf.strings.substr(parts, -4, 4)[0]
one_hot = parts == class_names
# Integer encode the label
return tf.argmax(one_hot)
def decode_img(img):
# Convert the compressed string to a 3D uint8 tensor
img = tf.io.decode_jpeg(img, channels=3)
# Resize the image to the desired size
return tf.image.resize(img, [img_height, img_width])
def process_path(file_path):
label = get_label(file_path)
# Load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
#Use Dataset.map to create a dataset of image, label pairs:
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
train_ds = train_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)
val_ds = val_ds.map(process_path, num_parallel_calls=tf.data.AUTOTUNE)
#Configure dataset for performance
def configure_for_performance(ds):
ds = ds.cache()
ds = ds.shuffle(buffer_size=1000)
ds = ds.batch(batch_size)
ds = ds.prefetch(buffer_size=tf.data.AUTOTUNE)
return ds
train_ds = configure_for_performance(train_ds)
val_ds = configure_for_performance(val_ds)
#Visualize the data
image_batch, label_batch = next(iter(train_ds))
plt.figure(figsize=(10, 10))
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(image_batch[i].numpy().astype("uint8"))
label = label_batch[i]
plt.title(class_names[label])
plt.axis("off")
Output:
Link to the COLAB file is:
https://colab.research.google.com/drive/1oUNuGVDWDLqwt_YQ80X-CBRL6kJ_YhUX?usp=sharing
I'm following this tutorial: How do I load train and test data from the local drive for a deep learning Keras model? and it went like this
name 'train_data' is not defined
I know I haven't defined train_data yet, but I don't know what to write inside train_data = ...
My code is look like this
train_path = '/Users/nayageovani/Documents/Artificial Intelligence/dataset/train'
train_batch = os.listdir(train_path)
x_train = []
# if data are in form of images
for sample in train_data:
img_path = train_path+sample
x = image.load_img(img_path)
# preprocessing if required
x_train.append(x)
test_path = PATH+'/data/test/'
test_batch = os.listdir(test_path)
x_test = []
heres my folder of dataset looks like
|--dataset
|--test
|--fresh
|--rotten
|--train
|--fresh
|--rotten
train_data (and test_data ) should be iterables that contain the file names of your training or test data, respectively.
You could, for example, create a list of files in the training data directory like:
import os
...
imgTypes = ['jpg', 'png', 'gif', 'bmp']
train_data = [item for item in os.listdir(train_path) if \
(os.path.isfile(os.path.join(train_path, item)) and
os.path.splitext(item)[1].lower() in imgTypes)]
Update:
A better alternative for loading the image data is using keras' ImageDataGenerator class. Among other things, it directly allows you to preprocess your data while loading.
Problem
I was following a Tensorflow 2 tutorial on how to load images with pure Tensorflow, because it is supposed to be faster than with Keras. The tutorial ends before showing how to split the resulting dataset (~tf.Dataset) into a train and validation dataset.
I checked the reference for tf.Dataset and it does not contain a split() method.
I tried slicing it manually but tf.Dataset neither contains a size() nor a length() method, so I don't see how I could slice it myself.
I can't use the validation_split argument of Model.fit() because I need to augment the training dataset but not the validation dataset.
Question
What is the intended way to split a tf.Dataset or should I use a different workflow where I won't have to do this?
Example Code
(from the tutorial)
BATCH_SIZE = 32
IMG_HEIGHT = 224
IMG_WIDTH = 224
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*'))
def get_label(file_path):
# convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
# The second to last is the class-directory
return parts[-2] == CLASS_NAMES
def decode_img(img):
# convert the compressed string to a 3D uint8 tensor
img = tf.image.decode_jpeg(img, channels=3)
# Use `convert_image_dtype` to convert to floats in the [0,1] range.
img = tf.image.convert_image_dtype(img, tf.float32)
# resize the image to the desired size.
return tf.image.resize(img, [IMG_WIDTH, IMG_HEIGHT])
def process_path(file_path):
label = get_label(file_path)
# load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
labeled_ds = list_ds.map(process_path, num_parallel_calls=AUTOTUNE)
#...
#...
I can either split list_ds (list of files) or labeled_ds (list of images and labels), but how?
I don't think there's a canonical way (typically, data is being split e.g. in separate directories). But here's a recipe that will let you do it dynamically:
# Caveat: cache list_ds, otherwise it will perform the directory listing twice.
ds = list_ds.cache()
# Add some indices.
ds = ds.enumerate()
# Do a rougly 70-30 split.
train_list_ds = ds.filter(lambda i, data: i % 10 < 7)
test_list_ds = ds.filter(lambda i, data: i % 10 >= 7)
# Drop indices.
train_list_ds = train_list_ds.map(lambda i, data: data)
test_list_ds = test_list_ds.map(lambda i, data: data)
Based on Dan Moldovan's answer I created a reusable function. Maybe this is useful to other people.
def split_dataset(dataset: tf.data.Dataset, validation_data_fraction: float):
"""
Splits a dataset of type tf.data.Dataset into a training and validation dataset using given ratio. Fractions are
rounded up to two decimal places.
#param dataset: the input dataset to split.
#param validation_data_fraction: the fraction of the validation data as a float between 0 and 1.
#return: a tuple of two tf.data.Datasets as (training, validation)
"""
validation_data_percent = round(validation_data_fraction * 100)
if not (0 <= validation_data_percent <= 100):
raise ValueError("validation data fraction must be ∈ [0,1]")
dataset = dataset.enumerate()
train_dataset = dataset.filter(lambda f, data: f % 100 > validation_data_percent)
validation_dataset = dataset.filter(lambda f, data: f % 100 <= validation_data_percent)
# remove enumeration
train_dataset = train_dataset.map(lambda f, data: data)
validation_dataset = validation_dataset.map(lambda f, data: data)
return train_dataset, validation_dataset
I have a very huge database of images locally, with the data distribution like each folder cointains the images of one class.
I would like to use the tensorflow dataset API to obtain batches de data without having all the images loaded in memory.
I have tried something like this:
def _parse_function(filename, label):
image_string = tf.read_file(filename, "file_reader")
image_decoded = tf.image.decode_jpeg(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, label
image_list, label_list, label_map_dict = read_data()
dataset = tf.data.Dataset.from_tensor_slices((tf.constant(image_list), tf.constant(label_list)))
dataset = dataset.shuffle(len(image_list))
dataset = dataset.repeat(epochs).batch(batch_size)
dataset = dataset.map(_parse_function)
iterator = dataset.make_one_shot_iterator()
image_list is a list where the path (and name) of the images have been appended and label_list is a list where the class of each image has been appended in the same order.
But the _parse_function does not work, the error that I recibe is:
ValueError: Shape must be rank 0 but is rank 1 for 'file_reader' (op: 'ReadFile') with input shapes: [?].
I have googled the error, but nothing works for me.
If I do not use the map function, I just recibe the path of the images (which are store in image_list), so I think that I need the map function to read the images, but I am not able to make it works.
Thank you in advance.
EDIT:
def read_data():
image_list = []
label_list = []
label_map_dict = {}
count_label = 0
for class_name in os.listdir(base_path):
class_path = os.path.join(base_path, class_name)
label_map_dict[class_name]=count_label
for image_name in os.listdir(class_path):
image_path = os.path.join(class_path, image_name)
label_list.append(count_label)
image_list.append(image_path)
count_label += 1
The error is in this line dataset = dataset.repeat(epochs).batch(batch_size) Your pipeline adds batchsize as a dimension to input.
You need to batch your dataset after map function like this
dataset = tf.data.Dataset.from_tensor_slices((tf.constant(image_list), tf.constant(label_list)))
dataset = dataset.shuffle(len(image_list))
dataset = dataset.repeat(epochs)
dataset = dataset.map(_parse_function).batch(batch_size)
We can generate image dataset using ImageDataGenerator with flow_from_directory method. For calling list of class, we can use oject.classes. But, how to call list of values? I've searched and still not found any.
Thanks :)
The ImageDataGenerator is a python generator, it would yield a batch of data with the shape same with your model inputs(like(batch_size,width,height,channels)) each time. The benefit of the generator is when your data set is too big, you can't put all the data to your limited memory, but, with the generator you can generate one batch data each time. and the ImageDataGenerator works with model.fit_generator(), model.predict_generator().
If you want to get the numeric data, you can use the next() function of the generator:
import numpy as np
data_gen = ImageDataGenerator(rescale = 1. / 255)
data_generator = datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical')
data_list = []
batch_index = 0
while batch_index <= data_generator.batch_index:
data = data_generator.next()
data_list.append(data[0])
batch_index = batch_index + 1
# now, data_array is the numeric data of whole images
data_array = np.asarray(data_list)
Alternatively, you can use PIL and numpy process the image by yourself:
from PIL import Image
import numpy as np
def image_to_array(file_path):
img = Image.open(file_path)
img = img.resize((img_width,img_height))
data = np.asarray(img,dtype='float32')
return data
# now data is a tensor with shape(width,height,channels) of a single image.
Then, you can loop all your images with this function to get the numeric data.
Notice, I recommend you to use generator instead of get all the data directly, or, you might run out of memory.
'But, how to call list of values' - If I understood correctly, I guess you wish to know what all files are there in your data set - if that's correct, (or if not), there are various ways you can get values from your generator:
use object.filenames.
Object.filenames returns the list of all files in your target folder. I just use the len(object.filename) function to get the total number of files in my test folder. Then pass that number back into my generator and run it again.
generator.n
Other way to get number of all items in your test folder is generator.n
x , y = test_generator.next() to load my array and classes ( if inferred).
Or a = test_generator.next(), where your array and classes will be returned as tuple.
I only used this as my test data set was really small ( 60 images) and I was using extracted features to train and predict my model( that is feature array and not the image array).
If you are building a normal model, using generator to yield batches is much better way.
Create a function using generator
def generate_test_data_from_directory(folder_path, image_target_size = 224, batch_size = 5, channels = 3, class_mode = 'sparse' ):
'''fetch all out test data from directory'''
test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
folder_path ,
target_size = (image_target_size, image_target_size),
batch_size = batch_size,
class_mode = class_mode)
total_images = test_generator.n
steps = total_images//batch_size
#iterations to cover all data, so if batch is 5, it will take total_images/5 iteration
x , y = [] , []
for i in range(steps):
a , b = test_generator.next()
x.extend(a)
y.extend(b)
return np.array(x), np.array(y)