How to extract a specific digit from the MNIST dataset with dataloader? - python

I am feeding the MNIST dataset to train my neural network in the following manner
indices = torch.arange(60000)
dataset = datasets.MNIST(root="dataset/", transform=transforms, download=True)
datasetsmall = data_utils.Subset(dataset, indices)
loader = DataLoader(datasetsmall, batch_size=batch_size, shuffle=True)
However, since the training is taking huge time to complete I have decided to train the model with only a specific digit from the MNIST dataset, for example the digit 4. How can I just extract the digit 4 and feed it to my neural network in the same way. The loop to train the neural network is like
for batch_idx, (real, _) in enumerate(loader):
Now I want only the digit 4 in the loader. How should I proceed in that case?

Does this code solve your problem?
import torch
from torchvision import datasets
from torch.utils.data import TensorDataset, DataLoader
from torchvision.transforms import ToTensor
cls = 4 # needed class
batch_size = 32
dataset = datasets.MNIST(root="dataset/", download=True, transform=ToTensor())
dataset = list(filter(lambda i: i[1] == cls, dataset))
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
s = 0
for i in loader:
s += 1
print(f'We\'ve got {s} batches with batch_size {batch_size} only for class {cls}')
# print(i) # uncomment this line if you want to examine last batch by yourself
Result:
We've got 183 batches with batch_size 32 only for class 4

Related

How to deal with DataCollator and DataLoaders in Huggingface?

I have issues combining a DataLoader and DataCollator. The following code with DataCollatorWithPadding results in a ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. when I want to iterate through the batches.
from torch.utils.data.dataloader import DataLoader
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16,
collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=data_collator)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
However, I found annother approach where I changed the DataCollator to lambda x: x Then it gives me a TypeError: DistilBertForSequenceClassification object argument after ** must be a mapping, not list
from torch.utils.data.dataloader import DataLoader
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, collate_fn=lambda x: x )
eval_dataloader = DataLoader(eval_dataset, batch_size=16, collate_fn=lambda x: x)
for epoch in range(2):
model.train()
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
For reproducability and for the rest of the code I provide you a Jupyter Notebook on Google Colab. You find the errors at the bottom of the notebook.
Link to Colab Notebook
If you take a look at the train_dataset object from your notebook:
print(train_dataset)
Output:
Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 25000
})
DataCollatorWithPadding doesn't know how to pad the text column because it's just a string.
Since you've already tokenized the dataset, you can simply remove the text column like so:
train_dataset = train_dataset.remove_columns("text")
The other three columns are all tensors and so can be padded by the data collator. Your first training loop will then run as expected.

What causes tensorflow keras Conv1D to only run the 1st epoch?

currently I am using tensorflow to create a neural network with a 1D convolutional layer and Dense layer to predict a single output value. The input array for the neural network is an array of 1500 samples; each sample is an array of 27x13 values.
I started training in the same manner as I did without the 1D conv layer, but the training stopped during the first epoch without warning.
I found that multiprocessing might be the cause and for that, I should turn multiprocessing off as discussed here: https://github.com/stellargraph/stellargraph/issues/1006
basically adding this to my keras model:
use_multiprocessing=False
That did not change anything, after which I found that I should probably use a DataSet to bypass multiprocessing issues according to
https://github.com/stellargraph/stellargraph/issues/1206
Replace tf.keras.Sequence objects with tf.data.Dataset #1206
after struggling with the difference between
tf.data.Dataset.from_tensors
and
tf.data.Dataset.from_tensor_slices
I found the following code to start executing the model.fit block again. As you might have guessed, it still stops running after the first epoch:
main loop started
Epoch 1/5
Press any key to continue . . .
Can someone pinpoint the source of the halting of the program?
This is my code:
import random
import numpy as np
from keras import backend as K
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.models import load_model
from keras.callbacks import CSVLogger
EPOCHS = 5
BATCH_SIZE = 16
def tfdata_generator(x, y, is_training, batch_size=BATCH_SIZE):
'''Construct a data generator using `tf.Dataset`. '''
dataset = tf.data.Dataset.from_tensor_slices((x, y))
if is_training:
dataset = dataset.shuffle(1500) # depends on sample size
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.repeat()
dataset = dataset.prefetch(1)
return dataset
def main():
print("main loop started")
X_train = np.random.randn(1500, 27, 13)
Y_train = np.random.randn(1500, 1)
training_set = tfdata_generator(X_train, Y_train, is_training=True)
data = np.random.randn(1500, 27, 13), Y_train
training_set = tf.data.Dataset.from_tensors((X_train, Y_train))
logstring = "C:\Documents\Conv1D"
csv_logger = CSVLogger((logstring + ".csv"), append=True, separator=';')
early_stopper = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=20, min_delta=0.00001)
model = keras.Sequential()
model.add(layers.Conv1D(
filters=10,
kernel_size=9,
strides=3,
padding="valid"))
model.add(layers.Flatten())
model.add(layers.Dense(70, activation='relu', name="layer2"))
model.add(layers.Dense(1))
optimizer =keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer, loss="mean_squared_error")
# WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
model.fit(training_set,
epochs = EPOCHS,
batch_size=BATCH_SIZE,
verbose = 2,
#validation_split=0.2,
use_multiprocessing=False);
model.summary()
modelstring = "C:\Documents\Conv1D_finishedmodel"
model.save(modelstring, overwrite=True)
model = load_model(modelstring)
main()

how to save all the generated image in a folder in pytorch

I am trying to use data augmentation with pytorch. I want to save all the generated images in a folder (target_dir) with different numbering based on the batch index.
Here is my code. I am using epoch=100 and batch_size=128.
import os
for batch_idx in range(BATCH_SIZE):
torchvision.utils.save_image(img_grid_fake, f"C:/UserspythonProjectgenerated_image/Fake_image%{batch_idx}d.png", global_step=step)
but i am only getting last 128 generated images, previous generated image are get deleted when next epoch run.
You need to save the images with f"Fake_image-{epoch}-{batch-idx}.png" so that both epoch and batch_idx are used in naming the files.
import os
import torch
from torch import nn
import torchvision
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
target_dir = r"C:/Users/PycharmProjects/pythonProject/generated/generated_image/"
EPOCHS = 10
BATCH_SIZE = 64
GRID_SIZE = 9 # 9 images in each grid
NUM_ROWS = 3 # sqrt(GRID_SIZE)
# if you want all the images in a batch to make the image-grid,
# set GRID_SIZE = BATCH_SIZE
train_dataset = YourFakeImageDataset()
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, transform=ToTensor())
for epoch in range(EPOCHS):
for batch_idx, (X, y) in enumerate(train_dataloader):
# assume X is the fake-image returned by the dataloader
# and y is some target value for the X, also returned by the dataloader
# ... do something with your images here
# B, C, H, W = X.shape
img_grid_fake = torchvision.utils.make_grid(X[:GRID_SIZE, ...], nrow=NUM_ROWS)
filepath = os.path.join(target_dir, f"Fake_image-{epoch}-{batch_idx}.png")
torchvision.utils.save_image(img_grid_fake, filepath)
NOTE: I cannot answer you properly, as your question does not specify a lot of details clearly (some of them are asked by others in the comments).
If you are making a fake-image-grid, how are you doing that? With torchvision.utils.make_grid()?
References
torchvision.utils.make_grid()
Visualizing a grid of images
torchvision.utils.make_grid()

Fast Data Generator for Training GoogLeNet model

I try to train GoogLeNet from scratch in Keras. I build the network architecture, and it is ready to train. Train GoogLeNet with auxiliaries outputs, the data generator should have three output labels. I write my custom data generator using tf.keras.utils.Sequence.
My custom generator is:
from skimage.transform import resize
from skimage.io import imread
import numpy as np
import math
from tensorflow.keras.utils import Sequence
class GoogLeNetDatasetGenerator(Sequence):
def __init__(self, X_train_path, y_train, batch_size):
"""
Initialize the GoogLeNet dataset generator.
:param X_train_path: Path of train images
:param y_train: Labels of train images
:param batch_size:
"""
self.X_train_path = X_train_path
self.y_train = y_train
self.batch_size = batch_size
self.indexes = np.arange(len(self.X_train_path))
np.random.shuffle(self.indexes)
def __len__(self):
"""
Denotes the number of batches per epoch
:return:
"""
return math.ceil(len(self.X_train_path) / self.batch_size)
def __getitem__(self, index):
"""
Get batch indexes from shuffled indexes
:param index:
:return:
"""
indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]
X_batch_names = [self.X_train_path[i] for i in indexes]
y_batch_naive = self.y_train[indexes]
X_batch = np.array([resize(imread(file_name), (224, 224) for file_name in X_batch_names],
dtype='float32')
y_batch = [y_batch_naive, y_batch_naive, y_batch_naive]
return X_batch, y_batch
def on_epoch_end(self):
"""
Updates indexes after each epoch
:return:
"""
self.indexes = np.arange(len(self.X_train_path))
np.random.shuffle(self.indexes)
Also, I compile and train the model with the following codes:
# Compile model
model.compile(loss=[CategoricalCrossentropy(), CategoricalCrossentropy(), CategoricalCrossentropy()],
loss_weights=[1, 0.3, 0.3], optimizer='adam',
metrics=['accuracy'])
# Train model
history = model.fit(train_dataset, validation_data=test_dataset, epochs=100)
While using the GPU version of TensorFlow, loading images in the data generator is time-consuming. It causes the training process slow. Is there any suggestion or other solutions for speeding up the loading data?
P.S.
I search the StackOverflow question such as this, but I did not find any idea.
I found another faster solution. You can use tf.data.Dataset. In the first step, I list all training images directory. Using the map method helped me read the image and properly configure the corresponding label. Here is my sample code to load an image with the ternary label.
image_filenames = tf.constant(image_list)
slices_dataset = tf.data.Dataset.from_tensor_slices(image_filenames)
slices_labels = tf.data.Dataset.from_tensor_slices(label_list)
image_dataset = slices_dataset.map(map_func=process_image)
label_dataset = slices_labels.map(map_func=process_label)
x_dataset = image_dataset.shuffle(buffer_size=Cfg.BUFFER_SIZE, seed=0).\
batch(batch_size=Cfg.BATCH_SIZE)
y_dataset = label_dataset.shuffle(buffer_size=Cfg.BUFFER_SIZE, seed=0).\
batch(batch_size=Cfg.BATCH_SIZE)
dataset = tf.data.Dataset.zip((x_dataset, y_dataset))

Ensure that all samples in a batch have the same shape in pytorch dataloader

My task is to do a multi label classification on a custom dataset with pyTorch and BERT. My data contains about 1500 samples. The amount of words can vary between 1000 and 50k words. Because BERT can only handle a max sequence of 512 I am using a sliding window approach on my data. Please note that a data sample can has several sentences.
For reference I´m working with the example notebooks here and from huggingface.
Here is a minimal version of my script:
import transformers
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertModel, BertConfig
import pandas as pd
import torch
from torch import cuda
import math
import transformers
from transformers import BertTokenizer, BertModel, BertConfig, AutoTokenizer
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
device = 'cuda' if cuda.is_available() else 'cpu'
MAX_LEN = 400
STRIDE = 20
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05
model_checkpoint = "bert-base-german-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, local_files_only=True)
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)
class CustomDataset(Dataset):
def __init__(self, dataframe, tokenizer, max_len, stride):
self.tokenizer = tokenizer
self.data = dataframe
self.text = dataframe.text
self.targets = self.data.labels
self.max_len = max_len
self.stride = stride
def __len__(self):
return len(self.text)
def __getitem__(self, index):
text = str(self.text[index])
text = " ".join(text.split())
inputs = self.tokenizer(
text,
None,
max_length=MAX_LEN,
stride=STRIDE,
padding='max_length',
truncation='only_first',
return_overflowing_tokens=True,
)
ids = inputs['input_ids']
mask = inputs['attention_mask']
token_type_ids = inputs["token_type_ids"]
return {
'ids': torch.tensor(ids, dtype=torch.long),
'mask': torch.tensor(mask, dtype=torch.long),
'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
'targets': torch.tensor(self.targets[index], dtype=torch.float)
}
I think the sliding window is working because if I run [len(x) for x in inputs["input_ids"]] I get a list of input_ids for my paragraph/text.
# Creating the dataset and dataloader for the neural network
train_size = 0.8
train_dataset=training_frame.sample(frac=train_size,random_state=200)
test_dataset=training_frame.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)
print("FULL Dataset: {}".format(training_frame.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN, STRIDE)
testing_set = CustomDataset(test_dataset, tokenizer, MAX_LEN, STRIDE)
train_params = {'batch_size': TRAIN_BATCH_SIZE,
'shuffle': False,
'num_workers': 0
}
test_params = {'batch_size': VALID_BATCH_SIZE,
'shuffle': True,
'num_workers': 0
}
training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)
Until here the script is running without any error, but if I try to iterate over training_set like here:
train_iter = iter(training_loader)
print(type(train_iter))
text, labels = train_iter.next()
print(text.size())
print(labels.size())
I get the following error:
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 133 and 75 in dimension 1 at /opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/TH/generic/THTensor.cpp:711
Process finished with exit code 1
In this question André mentions that the loaded batches have different shapes and that´s how this error occures. He suggests to set the batch_size = 1. However I want to use the defined the batch_size defined in my script.
I think the sliding window causes the error because the list of input_ids can vary per sample in my batchsamples because the total length of the text can be different.
How can I ensure that my data fed to the network has always the same shape?

Categories