Print random sample from dataloader in PyTorch - python

I have a certain dataset loaded into a dataloader. For example, if I wanted to save 100 images from this dataloader, how should I iterate over the dataloader to save them?

Im not exactly sure what you are trying to do (maybe edit your question) but maybe this helps:
dataset = Dataset()
dataloader = torch.utils.data.DataLoader(
dataloader,
batch_size=32,
num_workers=1,
shuffle=True)
for samples, targets in dataloader:
# 'sample' now is a batch of 32 (see batch-size above) elements of your dataset
Is this what you wanted? Hope so :)

Take dl as your dataloader.
If you want to print only the first batch you can do this:
for data, targets in dl:
print #whatever you want
break
But if you want the n-th batch you can do this:
for batch_idx, (data, target) in enumerate(train_loader):
if batch_idx == n:
print #whatever you want
break

Related

what does dataloaders in pytorch give in each iteration when we use map style?

train_loader = DataLoader(train_dataset, batch_size = 3, shuffle = True)
for batch in train_loader:
model.train()
x,y=batch
pred=model(x)
I have used map iteration in dataloaders
as of my concern , for each iteration i receive a ind,data from dataloader
why are we predicting the x (ind) in my code
The dataloader creates a stack of your torch tensor dataset, but this typically depends on your dataset. How much it stacks depends on your batch size, in your example which is 3.

Training a network for machine learning purpose, dividing the dataset in portions

I have a big dataset that can't be loaded in RAM due to lack of enough memory.
What I am trying to do is train the model in x portions of the dataset to get the final model trained in the whole dataset as following:
num_divisione_dataset=4
div_tr = int(int(len(x_tr))/num_divisione_dataset)
div_val = int(2160/num_divisione_dataset)
num_training = int(math.ceil(100/num_divisione_dataset))
for i in range(0,num_divisione_dataset-1):
model.fit(
x_tr[div_tr*i:div_tr*(i+1)], y_tr[div_tr*i:div_tr*(i+1)],
batch_size = 32,
callbacks=[model_checkpoint_callback],
validation_data = (x_val, y_val),
epochs = 25
)
Is it a right way to train a model?
The batch_size = 32 already is a way to train the model in batches of size 32. It seems you have two levels of batching, one that you built yourself an another that's provided by Tensorflow.
The problem with your batching is epochs=25. The Tensorflow batches alternate within an epoch, and the next epoch it loops again over the Tensorflow batches. But you first train 25 epochs with your first batch, then 25 epochs with your second batch, etcetera.
I'm not sure this is best solved in software. It might be easier to just ignore the lack of RAM, and let the OS swap to disk. Buying more RAM could be another viable route. But a possible software route would be an input pipeline
Put your data in a CSV file. Then use make_csv_dataset to load it in batches and pass it to model.fit. Make sure to set num_epochs=1, otherwise the data set will loop forever.
Here you can find an example on how to use it.
A minimal code should be:
DATASET_PATH=#path of the csv file
LABEL_COLUMN=#name of the column in csv file representing output
COLUMNS=["a","b","c","d"] #name of the columns in csv file representing input
BATCH_SIZE=int(len(x_tr)/num_divisione_dataset)
def get_dataset(batch_size = 5):
return tf.data.experimental.make_csv_dataset(DATASET_PATH, batch_size = batch_size, label_name = LABEL_COLUMN, num_epochs = 1)
dataset = get_dataset(batch_size=BATCH_SIZE)
train_size= #put the train_dataset size here
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)
columns=[]
for c in COLUMNS:
cln = tf.feature_column.numeric_column(c, shape=())
columns.append(cln)
feature_layer = tf.keras.layers.DenseFeatures(columns)
model = Sequential()
model.add(feature_layer)
model.add...# add your NN layers
model.compile... #parameters to compile
history = model.fit(
train_dataset,
validation_data=val_dataset,
callbacks=[model_checkpoint_callback],
epochs=25,
)

what is the best practices to train model on BIG dataset

I need to train a model on a dataset that required more memory than my GPU has. what is the best practice for feeding the dataset to model?
here is my steps:
first of all, I load dataset using batch_size
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
the second step i prepare data
for record in raw_train_ds.take(1):
train_images, train_labels = record['image'], record['label']
print(train_images.shape)
train_images = train_images.numpy().astype(np.float32) / 255.0
train_labels = tf.keras.utils.to_categorical(train_labels)
and then i feed data to the model
history = model.fit(train_images,train_labels, epochs=NUM_EPOCHS, validation_split=0.2)
but at step 2 I prepared data for the first batch and missed the rest batches because the model.fit is out of the loop scope (which, as I understand, works for one, first batch only).
On the other hand, I can't remove take(1) and move the model.fit method under the cycle. Because yes, in this case, I will handle all batches, but at the same time model.fill will be called at the end on each iteration and in this case, it also will not work properly
so, how I should change my code to be able to work appropriately with a big dataset using model.fit? could you point article, any documents, or just advise how to deal with it? thanks
Update
In my post below (approach 1) I describe one approach on how to solve the problem - are there any other better approaches or it is only one way how to solve this?
You can pass the whole dataset to fit for training. As you can see in the documentation, one of the possible values of the first parameter is:
A tf.data dataset. Should return a tuple of either (inputs, targets) or (inputs, targets, sample_weights).
So you just need to convert your dataset to that format (a tuple with input and target) and pass it to fit:
BATCH_SIZE=32
builder = tfds.builder('mnist')
builder.download_and_prepare()
datasets = builder.as_dataset(batch_size=BATCH_SIZE)
raw_train_ds = datasets['train']
train_dataset_fit = raw_train_ds.map(
lambda x: (tf.cast.dtypes(x['image'], tf.float32) / 255.0, x['label']))
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)
One problem with this is that it does not support a validation_split parameter but, as shown in this guide, tfds already gives you the functionality to have the splits of the data. So you would just need to get the test split dataset, transform it as above and pass it as validation_data to fit.
Approach 1
Thank #jdehesa I changed my code :
load dataset - in reality, it doesn't load data into memory till the first call 'next' from the dataset iterator. and even then, I think the iterator will load a portion of data (batch) with a size equal in BATCH_SIZE
raw_train_ds, raw_validation_ds = builder.as_dataset(split=["train[:90%]", "train[10%:]"], batch_size=BATCH_SIZE)
collected all required transformation into one method
def prepare_data(x):
train_images, train_labels = x['image'], x['label']
# TODO: resize image
train_images = tf.cast(train_images,tf.float32)/ 255.0
# train_labels = tf.keras.utils.to_categorical(train_labels,num_classes=NUM_CLASSES)
train_labels = tf.one_hot(train_labels,NUM_CLASSES)
return (train_images, train_labels)
applied these transformations to each element in batch (dataset) using the method td.data.Dataset.map
train_dataset_fit = raw_train_ds.map(prepare_data)
and then fed this dataset into model.fit - as I understand the model.fit will iterate through all batches in the dataset.
train_dataset_fit = raw_train_ds.map(prepare_data)
history = model.fit(train_dataset_fit, epochs=NUM_EPOCHS)

Get single random example from PyTorch DataLoader

How do I get a single random example from a PyTorch DataLoader?
If my DataLoader gives minbatches of multiple images and labels, how do I get a single random image and label?
Note that I don't want a single image and label per minibatch, I want a total of one example.
If your DataLoader is something like this:
test_loader = DataLoader(image_datasets['val'], batch_size=batch_size, shuffle=True)
it is giving you a batch of size batch_size, and you can pick out a single random example by directly indexing the batch:
for test_images, test_labels in test_loader:
sample_image = test_images[0] # Reshape them according to your needs.
sample_label = test_labels[0]
Alternative solutions
You can use RandomSampler to obtain random samples.
Use a batch_size of 1 in your DataLoader.
Directly take samples from your DataSet like so:
mnist_test = datasets.MNIST('../MNIST/', train=False, transform=transform)
Now use this dataset to take samples:
for image, label in mnist_test:
# do something with image and other attributes
(Probably the best) See here:
inputs, classes = next(iter(dataloader))
If you want to choose specific images from your Trainloader/Testloader, you should check out the Subset function from master:
Here's an example how to use it:
testset = ImageFolderWithPaths(root="path/to/your/Image_Data/Test/", transform=transform)
subset_indices = [0] # select your indices here as a list
subset = torch.utils.data.Subset(testset, subset_indices)
testloader_subset = torch.utils.data.DataLoader(subset, batch_size=1, num_workers=0, shuffle=False)
This way you can use exactly one image and label. However, you can of course use more than just one index in your subset_indices.
If you want to use a specific image from your DataFolder, you can use dataset.sample and build a dictionary to get the index of the image you want to use.
(This answer is to supplement Alternative 3 of #parthagar's answer)
Iterating through dataset does not return "random" examples, you should instead use:
# Recovers the original `dataset` from the `dataloader`
dataset = dataloader.dataset
n_samples = len(dataset)
# Get a random sample
random_index = int(numpy.random.random()*n_samples)
single_example = dataset[random_index]
TL;DR:
The general form to get a single example from a DataLoader is:
list = [ x[0] for x in iter(trainloader).next() ]
In particular to the question asked, where minbatches of images and labels are returned:
image, label = [ x[0] for x in iter(trainloader).next() ]
Possibly interesting information:
To get a single minibatch from the DataLoader, use:
iter(trainloader).next()
When running something like for images, labels in dataloader: what happens under the hood is an iterator is created via iter(dataloader), then the iterator's .next() is called on each loop execution.
To get a single image from a DataLoader, which returns images and labels use:
image = iter(trainloader).next()[0][0]
This is the same as doing:
images, labels = iter(trainloader).next()
image = images[0]
Random sample from DataLoader
Assuming DataLoader(shuffle=True) was used in its construction, a single random example can be drawn from the DataLoader with:
example = next(iter(dataloader))[0]
Random sample from Dataset
If that is not the case, you can draw a single random example from the Dataset with:
idx = torch.randint(len(dataset), (1,))
example = dataset[idx]
The key to get random sample is to set shuffle=True for the DataLoader, and the key for getting the single image is to set the batch size to 1.
Here is the example after loading the mnist dataset.
from torch.utils.data import DataLoader, Dataset, TensorDataset
bs = 1
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
for xb, yb in train_dl:
print(xb.shape)
x = xb.view(28,28)
print(x.shape)
print(yb)
break #just once
from matplotlib import pyplot as plt
plt.imshow(x, cmap="gray")

tensorflow dataset tf.estimator.inputs.numpy_input_fn

I'm writing a code for reading images and labels from disc in tensorflow and then trying to call tf.estimator.inputs.numpy_input_fn. How can I pass the whole dataset instead of single image. My code looks like:
filenames = tf.constant(filenames)
labels = tf.constant(labels)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset_batched = dataset.batch(10)
iterator = dataset_batched.make_one_shot_iterator()
features, labels = iterator.get_next()
with tf.Session() as sess:
print(dataset_batched)
print(np.shape(sess.run(features)))
print(np.shape(sess.run(labels)))
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_mk, model_dir=dir)
train_input_fn = tf.estimator.inputs.numpy_input_fn(x={"x": np.array(sess.run(features))},
y=np.array(sess.run(labels)),
batch_size=1,
num_epochs=None,
shuffle=False)
mnist_classifier.train(input_fn=train_input_fn, steps=1)
And my question is how can I pass dataset here x={"x": np.array(sess.run(features))}
There is no need/use for numpy_input_fn here. You should wrap the code at the top into a function (say, my_input_fn) that returns iterator.get_next() and, then pass input_fn=my_input_fn into the train call. This would pass the full dataset to the training code in batches of 10.
numpy_input_fn is for when you have the full dataset available in an array already and want a quick way to do batching/shuffling/repeating etc.

Categories