How do I get a single random example from a PyTorch DataLoader?
If my DataLoader gives minbatches of multiple images and labels, how do I get a single random image and label?
Note that I don't want a single image and label per minibatch, I want a total of one example.
If your DataLoader is something like this:
test_loader = DataLoader(image_datasets['val'], batch_size=batch_size, shuffle=True)
it is giving you a batch of size batch_size, and you can pick out a single random example by directly indexing the batch:
for test_images, test_labels in test_loader:
sample_image = test_images[0] # Reshape them according to your needs.
sample_label = test_labels[0]
Alternative solutions
You can use RandomSampler to obtain random samples.
Use a batch_size of 1 in your DataLoader.
Directly take samples from your DataSet like so:
mnist_test = datasets.MNIST('../MNIST/', train=False, transform=transform)
Now use this dataset to take samples:
for image, label in mnist_test:
# do something with image and other attributes
(Probably the best) See here:
inputs, classes = next(iter(dataloader))
If you want to choose specific images from your Trainloader/Testloader, you should check out the Subset function from master:
Here's an example how to use it:
testset = ImageFolderWithPaths(root="path/to/your/Image_Data/Test/", transform=transform)
subset_indices = [0] # select your indices here as a list
subset = torch.utils.data.Subset(testset, subset_indices)
testloader_subset = torch.utils.data.DataLoader(subset, batch_size=1, num_workers=0, shuffle=False)
This way you can use exactly one image and label. However, you can of course use more than just one index in your subset_indices.
If you want to use a specific image from your DataFolder, you can use dataset.sample and build a dictionary to get the index of the image you want to use.
(This answer is to supplement Alternative 3 of #parthagar's answer)
Iterating through dataset does not return "random" examples, you should instead use:
# Recovers the original `dataset` from the `dataloader`
dataset = dataloader.dataset
n_samples = len(dataset)
# Get a random sample
random_index = int(numpy.random.random()*n_samples)
single_example = dataset[random_index]
TL;DR:
The general form to get a single example from a DataLoader is:
list = [ x[0] for x in iter(trainloader).next() ]
In particular to the question asked, where minbatches of images and labels are returned:
image, label = [ x[0] for x in iter(trainloader).next() ]
Possibly interesting information:
To get a single minibatch from the DataLoader, use:
iter(trainloader).next()
When running something like for images, labels in dataloader: what happens under the hood is an iterator is created via iter(dataloader), then the iterator's .next() is called on each loop execution.
To get a single image from a DataLoader, which returns images and labels use:
image = iter(trainloader).next()[0][0]
This is the same as doing:
images, labels = iter(trainloader).next()
image = images[0]
Random sample from DataLoader
Assuming DataLoader(shuffle=True) was used in its construction, a single random example can be drawn from the DataLoader with:
example = next(iter(dataloader))[0]
Random sample from Dataset
If that is not the case, you can draw a single random example from the Dataset with:
idx = torch.randint(len(dataset), (1,))
example = dataset[idx]
The key to get random sample is to set shuffle=True for the DataLoader, and the key for getting the single image is to set the batch size to 1.
Here is the example after loading the mnist dataset.
from torch.utils.data import DataLoader, Dataset, TensorDataset
bs = 1
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
for xb, yb in train_dl:
print(xb.shape)
x = xb.view(28,28)
print(x.shape)
print(yb)
break #just once
from matplotlib import pyplot as plt
plt.imshow(x, cmap="gray")
Related
When I append my labels I end up with 20580 for the length of y when what I'm hoping to do is end up with 120 which is the number of categories. How can I append the categories to my labels?
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
import random as rand
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Activation, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.optimizers import Adam
config = tf.compat.v1.ConfigProto(gpu_options=tf.compat.v1.GPUOptions(allow_growth=True))
sess = tf.compat.v1.Session(config=config)
DATADIR = "C:/Users/samue/Documents/Datasets/DogBreeds/images/Images"
CATEGORIES = os.listdir("C:/Users/samue/Documents/Datasets/DogBreeds/images/Images")
IMG_SIZE = 100
training_data = []
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
class_num = CATEGORIES.index(category)
for img in os.listdir(path):
try:
img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
training_data.append([new_array, class_num])
except Exception as e:
pass
create_training_data()
rand.shuffle(training_data)
X = []
y = []
for features, label in training_data:
X.append(features)
y.append(label)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
y = np.array(y).reshape(-1,)
print(len(CATEGORIES))
print(len(X))
print(len(y))
The outputs I get at the end are:
120
20580
20580
I think you should step back a little from the implementation details or even this specific problem to try to understand what is going on. In image classification, the objective is to classify the input, a 2D or - 3D tensor if it's multichannel image - by assigning it to a label. The number of labels is finite, you can only classify into a certain number of class.
To give an example, let's take the MNIST database. It is a well known dummy-dataset used for image classification tasks. In the training set, there are 60,000 1x28x28-images representing handwritten digits. Generally speaking, the goal with this dataset is to classify properly each image to a total of 10 labels. The labels correspond to numbers "0", "1", "2", ..., and so on until "9". So the question in this particular case is given image X, my model needs to predict a class for this image: either "0", "1", ..., or "9", there are only 10 possibilities. In supervised learning, we use labels to train the model. For any given input, we need to know the ground-truth i.e. the real class this input belongs to. So in turn you end up with as many inputs as there are labels: because each one is assigned it's own label, regardless of the number of unique possible labels.
In your use case, it seems you are working with a total of 120 classes and 20,580 images. That's 20,580 unique data inputs. Remember, we need to have, for each one of those images, a corresponding ground-truth: the real class this image belongs to. So naturally you would end up with a total of 20,580 labels as well.
This might have been the source of your confusion: in my own terms label is different to class. A class set is a unique set of entities (animals, digits, ...) while a label refers to a particular class inside a class set.
I think you are a bit confused. You should have a data set consisting of 120 classes.
For each of those classes you need to have images characteristic of that class. For example assume you are building a classifier to distinguish between images of dogs and images of cats. So you have 2 classes. You can structure your directories as follows
source_dir
----------cats_dir
------------------cats first image
------------------cats second images
------------------cats nth image
----------dogs_dir
------------------dogs first image
------------------dogs second image
------------------dogs m th image
For your case you will have 120 sub directories (class directories) below the source_dir and each of these should contain images associated with class. In your case it would appear that you have a total of 20580 images. If they are evenly distributed you have roughly 171 images for each class. Now you want to use these images to train a CNN. You can do it the way you were proceeding however I recommend against it because you will end up putting all 20580 100 X 100 images into memory all at once. This will take a very big memory and you are likely to get an OOM (out of memory) error. The way you solve that is to feed the data to your model in batches. For example 32 images at a time. Now Keras has useful functions to assist you in doing that. If you have the directory structure as shown above you can use the ImageDataGenerator.flow_from_directory to feed your images to the model in batches. Documentation is here. This function also enables you to use image augmentation to help expand the diversity of your data set. Below is the code I recommend for the example of dog/cat classification I mentioned above.
source_dir-r'c:\temp\cats_and_dogs'
v_split=.2 # set this to determine the percentage of data to allocate to the validation set
data_gen=ImageDataGenerator(rescale=1/255,validation_split=v_split)
train_gen=data_gen.flow_from_directory(source_dir, target_size=(100,100),
class_mode='categorical', batch_size=32,
subset='training', color_mode='grayscale)
valid_gen=data_gen.flow_from_directory(source_dir, target_size=(100,100),
class_mode='categorical', batch_size=32,
subset='validation', color_mode='grayscale, shuffle=False)
when you compile your model set loss='categorical_crossentropy'.
you can use the two generators above as inputs to model.fit
I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?
You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.
Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
I am working on the Python/tensorflow/mnist tutorial.
Since a few weeks using the orignal code from tensorflow web site i get the warning that the image dataset would soon be deprecated abd that i should use the following one :
https://github.com/tensorflow/models/blob/master/official/mnist/dataset.py
I load it it my code using :
from tensorflow.models.official.mnist import dataset
trainfile = dataset.train(data_dir)
Which returns :
tf.data.Dataset.zip((images, labels))
The issue is that I cannot find a,way to separate them in the following way for example :
trainfile = dataset.train(data_dir)
train_data= trainfile.images
train_label= trainfile.label
But this clearly doesnot work because the attributrs images and label do not exist. trainfile is a tf.dataset.
Knowing that tf.dataset is made of int32 and float32 i tried :
train_data = trainfile.map(lambda x,y : x.dtype == tf.float32)
But it returns and empty dataset.
I insist (but will be open mimded) in doing it this way (two complete batches of image and label) because this is how the tutorial works :
https://www.tensorflow.org/tutorials/estimators/cnn
I saw a lot of solution to get elements from datasets but nothing to go back from the zip operations that is done in the following code
tf.data.Dataset.zip((images, labels))
Thanks you in advance for your help.
I hope this helps:
inputs = tf.placeholder(tf.float32, shape=(None, 784), name='inputs')
outputs = tf.placeholder(tf.float32, shape=(None,), name='outputs')
#Prepare a tensorflow dataset
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.shuffle(buffer_size=10, reshuffle_each_iteration=True).batch(batch_size=batch_size, drop_remainder=True).repeat()
iter = ds.make_one_shot_iterator()
next = iter.get_next()
inputs = next[0]
outputs = next[1]
TensorFlow's get_single_element() is finally around which can be used to unzip features and labels from the dataset.
This avoids the need of generating and using an iterator using .map(), iter() or one_shot_iterator() (which could be costly for big datasets).
get_single_element() returns a tensor (or a tuple or dict of tensors) encapsulating all the members of the dataset. We need to pass all the members of the dataset batched into a single element.
This can be used to get features as a tensor-array, or features and labels as a tuple or dictionary (of tensor-arrays) depending upon how the original dataset was created.
Check this answer on SO for an example that unpacks features and labels into a tuple of tensor-arrays.
Instead of separating into two datasets, one for images and another for labels, it's best to make a single iterator which returns both the image and the label.
The reason why this is preferred is that it's a lot easier to ensure that you match each example with its label even after a complicated series of shuffles, reorderings, filterings, etc, as you might have in a nontrivial input pipeline.
You can visualize images and find its associated labels
ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
ds = ds.shuffle(buffer_size=10).batch(batch_size=batch_size)
iter = ds.make_one_shot_iterator()
next = iter.get_next()
def display(image, label):
# display image
...
plt.imshow(image)
...
with tf.Session() as sess:
try:
while True:
image, label = sess.run(next)
# image = numpy array (batch, image_size)
# label = numpy array (batch, label)
display(image[0], label[0]) #display first image in batch
except:
pass
test_batches = ImageDataGenerator(
preprocessing_function=preprocess_input
).flow_from_directory(test_path,target_size=(224,224),batch_size=1,class_mode=None,shuffle = "false")
prediction = model.predict_generator(test_batches, steps=1, verbose=1)
np.argmax(prediction)
So here I am testing one image by using step_size=1 and steps=1. Whenever I run this I get different predictions, which means it's not picking the same image every time. How can I check the image name?
EDIT: Here's another attempt to explain the problem I am facing:
test_batches = ImageDataGenerator(
preprocessing_function=preprocess_input
).flow_from_directory(test_path,target_size=(224,224),batch_size=2,class_mode=None,shuffle = "false")
prediction = model.predict_generator(test_batches, steps=1, verbose=2)
The prediction variable has two arrays of prediction probabilities. How can I know for which images these predictions are for?
If your goal is to get acquainted with keras ImageDataGenerators:
If you want your generator to always return the same image (for reproducability):
from keras.preprocessing.image import ImageDataGenerator
import numpy as np
data_dir = 'path/to/image/directory' # path to the directory where the images are stored
index = 0 # select a number here
ig = ImageDataGenerator()
gen = ig.flow_from_directory(data_dir, batch_size=1) # if you want batch_size > 1 you need to
# add as many indices as your batch_size.
image, label = gen._get_batches_of_transformed_samples(np.array([index]))
image_name = gen.filenames[index]
# do whatever you want with your image and label
If you want your generator to always return a random image but know which one it is I would suggest doing the following:
index = next(gen.index_generator)
image, label = gen._get_batches_of_transformed_samples(index)
image_name = gen.filenames[index]
If you want to see how predict_generator works, however none of these approaches will help you out. The only thing I can think of is editing the DirectoryIterator code.
For example you could add a line that prints the name of the image you are passing. I would suggest adding the following statement after line 1434:
print(fname)
You can use the generator.filename attribute
image_name=test_batch.filenames[0]
I am using the getting started example of Tensorflow CNN and updating parameters to my own data but since my model is large (244 * 244 features) I got OutOfMemory error.
I am running the training on Ubuntu 14.04 with 4 CPUs and 16Go of RAM.
Is there a way to shrink my data so I don't get this OOM error?
My code looks like this:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="path/to/model")
# Load the data
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
batch_size=5,
shuffle=True)
# Train the model
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
Is there a way to shrink my data so I don't get this OOM error?
You can slice your training_set to obtain just a portion of the dataset. Something like:
x={"x": np.array(training_set.data)[:(len(training_set)/2)]},
y=np.array(training_set.target)[:(len(training_set)/2)],
In this example you are getting the first half of your dataset (you can select up to what point of your dataset you want to load).
Edit: Another way you can do this is to obtain a random subset of your training dataset. This you can achieve by masking elements on your dataset array. For example:
import numpy as np
from random import random as rn
#obtain boolean mask to filter out some elements
#here you can define your sample %
r = 0.5 #say filter half the elements
mask = [True if rn() >= r else False for i in range(len(training_set))]
#finally, mask out those elements,
#the result will have ~r times the original elements
reduced_ds = training_set[mask]